mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
added architechtural decisions, added to prd
This commit is contained in:
parent
3afeeac1b5
commit
d9c1696085
4 changed files with 245 additions and 0 deletions
208
CONTEXT.md
Normal file
208
CONTEXT.md
Normal file
|
|
@ -0,0 +1,208 @@
|
|||
# Ara
|
||||
|
||||
The Domna product for domestic retrofit modelling: ingests open-source EPC data, lets users correct or supersede it with their own surveys, and produces optimised retrofit packages for each property in a portfolio.
|
||||
|
||||
## Language
|
||||
|
||||
### Product
|
||||
|
||||
**Ara**:
|
||||
The Domna product. Latin for "the altar"; named under Domna's classical-naming convention. Covers both the modelling product and the backend that powers it.
|
||||
_Avoid_: ARA (acronym style), v2 backend, the new backend
|
||||
|
||||
**Domna**:
|
||||
The company. Roman name; sibling to Ara in the same naming convention.
|
||||
|
||||
### Energy Performance Certificates
|
||||
|
||||
**EPC**:
|
||||
An Energy Performance Certificate — a government-issued document rating a dwelling's energy efficiency from A (best) to G (worst).
|
||||
_Avoid_: energy certificate, energy report
|
||||
|
||||
**Certificate Number**:
|
||||
The unique identifier assigned to an EPC by the government registry.
|
||||
_Avoid_: cert number, EPC ID
|
||||
|
||||
**Registration Date**:
|
||||
The date an EPC was lodged with the government register; used to identify the most recent certificate for a property.
|
||||
_Avoid_: assessment date, submission date
|
||||
|
||||
**EPC Band**:
|
||||
A single letter A–G representing a property's current or potential energy efficiency rating.
|
||||
_Avoid_: energy rating, EPC grade, EPC score
|
||||
|
||||
**Schema Type**:
|
||||
The versioned RdSAP or SAP schema that describes the structure of an EPC's raw data (e.g. `RdSAP-Schema-21.0.1`).
|
||||
_Avoid_: schema version, EPC format
|
||||
|
||||
**Domestic Certificate**:
|
||||
An EPC issued for a residential dwelling, as opposed to a commercial one.
|
||||
_Avoid_: residential EPC, home EPC
|
||||
|
||||
### Properties and addresses
|
||||
|
||||
**Property**:
|
||||
The Ara domain aggregate representing a single dwelling under modelling: its identity, source data, enrichments, and modelling outputs.
|
||||
_Avoid_: dwelling, unit, home, asset
|
||||
|
||||
**Properties**:
|
||||
A first-class collection of Property objects; the unit of bulk operation in services.
|
||||
_Avoid_: property list, batch (used for SQS chunks)
|
||||
|
||||
**UPRN**:
|
||||
Unique Property Reference Number — the government-issued permanent identifier for a physical address in the UK.
|
||||
_Avoid_: property ID, address ID, code
|
||||
|
||||
**Postcode**:
|
||||
A UK postal code used to group nearby addresses; the primary search key for finding EPC records.
|
||||
_Avoid_: zip code, postal code
|
||||
|
||||
**User Address**:
|
||||
A free-text address string provided by a user or imported from a customer dataset, before any normalisation or matching.
|
||||
_Avoid_: user input, raw address, user_inputed_address
|
||||
|
||||
**Comparable Properties**:
|
||||
The reference cohort matched to a target Property by both geographic proximity (postcode prefix / UPRN range) and physical similarity (property type, built form, age band); used by the EPC Prediction Service for gap-filling and anomaly detection.
|
||||
_Avoid_: neighbours, similar properties, peer set
|
||||
|
||||
### Source data
|
||||
|
||||
**Site Notes**:
|
||||
The full-coverage record produced by a Domna survey of a single Property; carries every EPC field the modelling pipeline requires, and when present supersedes the public EPC for that Property — except when the public EPC is newer.
|
||||
_Avoid_: energy assessment, site survey, field survey, Domna survey, Hestia survey
|
||||
|
||||
**Landlord Overrides**:
|
||||
Property data supplied by a landlord that may correct or supplement the public EPC for a single Property; triggers Rebaselining when applied; not applicable when Site Notes are present.
|
||||
_Avoid_: patches (deprecated), corrections, manual EPC, edits
|
||||
|
||||
### Modelling
|
||||
|
||||
**Effective EPC**:
|
||||
The EpcPropertyData scored by the modelling pipeline for a single Property, derived from either Site Notes alone or the public EPC with Landlord Overrides applied; carries source-derived physical fields and originally recorded performance values, with model-rebaselined performance held separately in Baseline Performance.
|
||||
_Avoid_: modelling EPC, working EPC, resolved EPC, derived EPC
|
||||
|
||||
**Rebaselining**:
|
||||
Recomputing a Property's Baseline Performance via ML when its Effective EPC diverges from the originally lodged public EPC, or when no previous baseline exists.
|
||||
_Avoid_: re-scoring, re-prediction, performance recomputation
|
||||
|
||||
**Baseline Performance**:
|
||||
The set of ML-predicted performance values for a single Property — SAP, carbon emissions, heat demand, annual kWh — produced by scoring the Effective EPC against the kWh model; distinct from the originally recorded performance fields on the Effective EPC.
|
||||
_Avoid_: baseline predictions, predicted baseline, rebaselined values
|
||||
|
||||
**EPC Anomaly Flag**:
|
||||
A per-field indicator that a Property's value for an EPC field differs significantly from Comparable Properties; advisory only — surfaces in the UI to prompt user review, does not block modelling.
|
||||
_Avoid_: outlier, mismatch, divergence flag
|
||||
|
||||
### Outputs
|
||||
|
||||
**Scenario**:
|
||||
A named portfolio-level container for a single modelling run, capturing the goal (e.g. Increasing EPC), budget, exclusions, and housing type; holds many Plans.
|
||||
_Avoid_: project, batch, run-set
|
||||
|
||||
**Plan**:
|
||||
The per-Property output of a single modelling run; belongs to one Scenario and carries the Property's full Recommendation list, Optimised Package, and post-retrofit predictions.
|
||||
_Avoid_: recommendation set, output, result
|
||||
|
||||
**Recommendation**:
|
||||
A single proposed retrofit measure for a Property, with its cost, SAP impact, kWh savings, carbon savings, and parts list.
|
||||
_Avoid_: suggestion, option
|
||||
|
||||
**Optimised Package**:
|
||||
The subset of a Property's Recommendations selected by the Optimiser Service for installation, chosen to satisfy the Scenario's goal subject to budget.
|
||||
_Avoid_: selected measures, default measures, optimal solution, recommended bundle
|
||||
|
||||
**Measure Type**:
|
||||
The catalogue classification of a retrofit measure (e.g. `solar_pv`, `loft_insulation`, `ashp`); one or more Recommendations reference the same Measure Type with property-specific cost and impact.
|
||||
_Avoid_: measure (ambiguous), category
|
||||
|
||||
### Address matching
|
||||
|
||||
**Lexiscore**:
|
||||
A similarity score in [0, 1] between a User Address and a candidate EPC address; combines token overlap and character-level similarity.
|
||||
_Avoid_: score, match score, similarity
|
||||
|
||||
**Lexirank**:
|
||||
Dense rank of candidates sorted by Lexiscore descending; rank 1 = best match.
|
||||
_Avoid_: rank, position
|
||||
|
||||
**UPRN Candidate**:
|
||||
An EPC Search Result that is a plausible match for a given User Address, before scoring decides the winner.
|
||||
_Avoid_: match candidate, result
|
||||
|
||||
**Score Threshold**:
|
||||
The minimum Lexiscore (currently 0.6) below which no match is returned even if a candidate exists.
|
||||
_Avoid_: minimum score, cutoff
|
||||
|
||||
**Ambiguous Match**:
|
||||
A matching outcome where two or more candidates share Lexirank 1, making it impossible to select a unique winner.
|
||||
_Avoid_: tie, draw, duplicate
|
||||
|
||||
**Best Match**:
|
||||
The single UPRN Candidate with Lexirank 1 that meets or exceeds the Score Threshold.
|
||||
_Avoid_: winner, top result
|
||||
|
||||
### API and integration
|
||||
|
||||
**EPC Search Result**:
|
||||
A lightweight record returned by the government domestic search endpoint — address lines, postcode, UPRN, band, and certificate number, but not full certificate data.
|
||||
_Avoid_: search row, EPC row, result
|
||||
|
||||
**EPC Property Data**:
|
||||
The fully mapped domain object produced after fetching and parsing a complete EPC certificate; the schema the modelling pipeline operates against.
|
||||
_Avoid_: EPC data, certificate data, parsed EPC
|
||||
|
||||
**Old EPC API**:
|
||||
The retired government API (`epc.opendatacommunities.org`) using HTTP Basic auth; decommissioned 30 May 2026.
|
||||
_Avoid_: legacy API
|
||||
|
||||
**New EPC API**:
|
||||
The replacement government API (`api.get-energy-performance-data.communities.gov.uk`) using Bearer Token auth.
|
||||
_Avoid_: new API, current API
|
||||
|
||||
**Bearer Token**:
|
||||
The auth credential required by the New EPC API; stored in the `EPC_AUTH_TOKEN` environment variable.
|
||||
_Avoid_: API key, auth token, secret
|
||||
|
||||
## Relationships
|
||||
|
||||
- A **Property** represents a single physical dwelling for modelling; identified by `(portfolio_id, UPRN)` or `(portfolio_id, landlord_property_id)`.
|
||||
- A **Property** has zero or more **EPCs** across time, exactly one **Effective EPC**, zero or one set of **Site Notes**, and zero or one set of **Landlord Overrides**.
|
||||
- An **EPC** belongs to exactly one **Property** and has one **Certificate Number**.
|
||||
- An **EPC** carries an **EPC Band** and is identifiable by its **Registration Date**; the most recent one is the current.
|
||||
- A **UPRN** identifies a physical dwelling permanently; it does not change when the property changes owner — but each portfolio gets its own **Property** keyed against it.
|
||||
- When a **Property** has both **Site Notes** and a public **EPC**, the newer of the two derives the **Effective EPC**. **Landlord Overrides** apply only when the **EPC** is the source — never when **Site Notes** are.
|
||||
- **Rebaselining** produces **Baseline Performance** for a Property; triggered when the **Effective EPC** diverges from the originally lodged EPC (because of **Site Notes**, **Landlord Overrides**, an expired EPC, or an estimated EPC).
|
||||
- The **EPC Prediction Service** uses **Comparable Properties** for both gap-filling and producing **EPC Anomaly Flags**.
|
||||
- A **Scenario** contains many **Plans** (one per Property). A **Plan** carries many **Recommendations**; the **Optimised Package** is the subset selected for installation.
|
||||
- A **Recommendation** references one **Measure Type** and carries property-specific cost and impact.
|
||||
- **Address Matching** uses a **User Address** and **Postcode** to find a **UPRN** by scoring **UPRN Candidates** from an EPC search. A **Lexirank** of 1 with no **Ambiguous Match** and a **Lexiscore** ≥ the **Score Threshold** produces a **Best Match**.
|
||||
|
||||
## Example dialogue
|
||||
|
||||
> **Dev:** "A landlord uploads a corrected boiler for one of their properties. What happens?"
|
||||
>
|
||||
> **Domain expert:** "That's a **Landlord Override** on the heating fields. Save it against the **Property**, then trigger **Rebaselining** — the **Effective EPC** has changed, so we need fresh **Baseline Performance** before regenerating **Recommendations**."
|
||||
|
||||
> **Dev:** "What if the same Property also has Site Notes?"
|
||||
>
|
||||
> **Domain expert:** "**Site Notes** supersede the public **EPC**, so **Landlord Overrides** don't apply. We model from the **Site Notes** version of the **Effective EPC**. If the public **EPC** is newer than the **Site Notes**, that's the one exception — we use the newer one."
|
||||
|
||||
> **Dev:** "After modelling we end up with a list of measures. Which ones get installed?"
|
||||
>
|
||||
> **Domain expert:** "The **Optimiser Service** picks the **Optimised Package** — a subset of **Recommendations** that hits the **Scenario** goal within budget. The rest stay in the **Plan** as alternatives the user can swap in."
|
||||
|
||||
> **Dev:** "I'm looking at a property where the EPC says cavity walls but every other house on the street has solid. Is that a bug?"
|
||||
>
|
||||
> **Domain expert:** "That's an **EPC Anomaly Flag**. We compute it against the **Comparable Properties** for that postcode. It's advisory — the UI surfaces it and the landlord can apply a **Landlord Override** if it's wrong."
|
||||
|
||||
## Flagged ambiguities
|
||||
|
||||
- **"property"** was historically warned against in favour of "dwelling"; that has been inverted. **Property** is now canonical for the Ara domain aggregate. Legacy code still uses "dwelling" in places — treat as alias.
|
||||
- **"energy assessment"** in the existing codebase (`energy_assessment_functions`, `energy_assessments_by_uprn`) refers to what is now canonically called **Site Notes**. New code uses **Site Notes**.
|
||||
- **"patch"** / `patch_epc` in the existing codebase has been merged into **Landlord Overrides**; the original concept is deprecated.
|
||||
- **"already_installed measures"** in the existing codebase is likely subsumed by **Landlord Overrides** ("we have a heat pump now" → override the heating fields). Final call deferred to implementation.
|
||||
- **"address"** appears as both the raw **User Address** (free-text) and a structured field on an **EPC Search Result** (normalised lines). Always qualify: "user address" vs "EPC address" or "address line 1".
|
||||
- **"score"** is used for `AddressMatch.score()` output, the `lexiscore` column, and informally. Prefer **Lexiscore** in domain discussions; reserve "score" for method-level code comments.
|
||||
- **"user_inputed_address"** in `backend/address2UPRN/main.py` is a misspelling and a synonym for **User Address** — the canonical term. New code should use `user_address`.
|
||||
- **"EPC"** is overloaded as both the document and the rating band letter. Use **EPC** for the document, **EPC Band** for the letter.
|
||||
- **"re-scoring"** has two meanings in the codebase — **Rebaselining** (re-predicting baseline performance after an EPC change) and post-optimisation measure re-prediction. Prefer **Rebaselining** for the former; for the latter, the **Optimiser Service** step does its own scoring without a special name.
|
||||
10
docs/adr/0001-two-source-paths.md
Normal file
10
docs/adr/0001-two-source-paths.md
Normal file
|
|
@ -0,0 +1,10 @@
|
|||
# Two source paths for a Property, not layered precedence
|
||||
|
||||
For modelling a Property we considered a strict layered precedence stack — `patches > site_notes > energy_assessment > epc > predicted` — with per-field provenance tracking. We rejected that in favour of **two strictly disjoint source paths**: a Property is modelled either from its Site Notes alone, or from the public EPC with Landlord Overrides applied on top. Site Notes are committed to being full-coverage by the domain ([CONTEXT.md](../../CONTEXT.md): _Site Notes_), so once we have them the EPC is irrelevant; conversely, Landlord Overrides are only meaningful when the EPC is the source of physical state.
|
||||
|
||||
The trade-off: layered precedence is more flexible (it tolerates a partial Site Notes survey by falling through to EPC for missing fields), but mixed-source data muddles the audit trail and undermines the "if we surveyed it, trust the survey" promise. The two-path model gives a cleaner derivation rule and an unambiguous source-of-truth per Property, at the cost of treating survey gaps as a survey-quality bug rather than a fallback signal. A Recency Tie-Break covers the one case where both exist: the newer of the two wins.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Reversing this means rewriting `Property.effective_epc` and every service that reads it. Hard to roll back once 12 services depend on the two-path shape.
|
||||
- Future addition of a third path (e.g. partial-survey) is a real change, not just a config tweak — flag it as an ADR if proposed.
|
||||
14
docs/adr/0002-property-aggregate-root.md
Normal file
14
docs/adr/0002-property-aggregate-root.md
Normal file
|
|
@ -0,0 +1,14 @@
|
|||
# `Property` is the aggregate root, not `EpcPropertyData`
|
||||
|
||||
The Ara modelling pipeline produces nine slices of per-property data (EPC, geospatial, solar, baseline performance, recommendations, optimised package, etc.). We considered making `EpcPropertyData` — the rich RdSAP-21-style EPC schema — the centrepiece, with other data hanging off it. We rejected that and introduced a new **`Property` aggregate root** that holds identity, all source data (EPC, Site Notes, Landlord Overrides), enrichments, and modelling outputs as named fields. Services take `Property` (or `Properties`) and return them with one slice populated.
|
||||
|
||||
Two reasons drove this:
|
||||
1. **Geospatial, solar, recommendations, and overrides are peers to the EPC**, not properties of it. Putting them on `EpcPropertyData` conflates physical-state schema with modelling-run state.
|
||||
2. **A typed `ModellingContext` dict-bag (the obvious alternative)** is exactly what the current legacy `Property` class became — 1259 lines of accumulated stuff, hard to read, hard to test, hard to extend. Named fields on a dataclass force the type system to keep us honest.
|
||||
|
||||
The cost is more domain types up front (`Property`, `Properties`, `PropertyIdentity`, `BaselinePerformance`, `OptimisedPackage`, etc.) and the discipline of one service writing one slice. The benefit is that every service has a single job and every test injects fake repos against a small, named structure.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Every service signature accepts or returns `Property` / `Properties`. Refactoring later means touching all of them.
|
||||
- `EpcPropertyData` stays a pure physical-state schema (defined in [datatypes/epc/domain/epc_property_data.py](../../datatypes/epc/domain/epc_property_data.py)) — no modelling outputs or run state on it.
|
||||
13
docs/adr/0003-strict-ingestion-modelling-separation.md
Normal file
13
docs/adr/0003-strict-ingestion-modelling-separation.md
Normal file
|
|
@ -0,0 +1,13 @@
|
|||
# Strict separation between Ingestion and Modelling
|
||||
|
||||
Data flows one way only: **Ingestion → Repos → Modelling**. Modelling services never make external HTTP calls; Ingestion services never run business logic. If Modelling needs fresh data, it sees a stale record in a repo and returns; the caller (a refresh orchestrator or the FE) decides whether to ingest first. We considered allowing modelling services to call fetchers directly on cache miss — convenient — and rejected it.
|
||||
|
||||
The trade-off is that modelling cannot "self-heal" by going to the gov EPC API when it finds stale data. The benefit is that modelling becomes a deterministic function of repository state: same Property in the repos, same modelling output. That is the property that makes modelling unit-testable against fakes (no DB, no network, no ML lambda), reproducible, and debuggable. It also enables a per-property UI flow where fetched data is shown to the user for review and possible override **before** modelling runs.
|
||||
|
||||
Under the rushed timeline this constraint is more valuable, not less. Mixing fetchers into services is the easy thing to do when shipping fast; once it's done it's hard to extract.
|
||||
|
||||
## Consequences
|
||||
|
||||
- Every modelling service depends only on Repos (and other Services / domain logic). No HTTP libraries in the modelling import graph.
|
||||
- A `RefreshOrchestrator` is the only thing that calls Ingestion then Modelling in sequence; nothing else may.
|
||||
- "Modelling is stale, refetch in-line" is a forbidden pattern — surface staleness, do not silently repair it.
|
||||
Loading…
Add table
Reference in a new issue