added architechtural decisions, added to prd

2026-06-08 11:17:27 +00:00 · 2026-05-13 21:26:18 +00:00 · 2026-05-13 21:26:18 +00:00 · d9c1696085
commit d9c1696085
parent 3afeeac1b5
4 changed files with 245 additions and 0 deletions
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -0,0 +1,208 @@
+# Ara
+
+The Domna product for domestic retrofit modelling: ingests open-source EPC data, lets users correct or supersede it with their own surveys, and produces optimised retrofit packages for each property in a portfolio.
+
+## Language
+
+### Product
+
+**Ara**:
+The Domna product. Latin for "the altar"; named under Domna's classical-naming convention. Covers both the modelling product and the backend that powers it.
+_Avoid_: ARA (acronym style), v2 backend, the new backend
+
+**Domna**:
+The company. Roman name; sibling to Ara in the same naming convention.
+
+### Energy Performance Certificates
+
+**EPC**:
+An Energy Performance Certificate — a government-issued document rating a dwelling's energy efficiency from A (best) to G (worst).
+_Avoid_: energy certificate, energy report
+
+**Certificate Number**:
+The unique identifier assigned to an EPC by the government registry.
+_Avoid_: cert number, EPC ID
+
+**Registration Date**:
+The date an EPC was lodged with the government register; used to identify the most recent certificate for a property.
+_Avoid_: assessment date, submission date
+
+**EPC Band**:
+A single letter A–G representing a property's current or potential energy efficiency rating.
+_Avoid_: energy rating, EPC grade, EPC score
+
+**Schema Type**:
+The versioned RdSAP or SAP schema that describes the structure of an EPC's raw data (e.g. `RdSAP-Schema-21.0.1`).
+_Avoid_: schema version, EPC format
+
+**Domestic Certificate**:
+An EPC issued for a residential dwelling, as opposed to a commercial one.
+_Avoid_: residential EPC, home EPC
+
+### Properties and addresses
+
+**Property**:
+The Ara domain aggregate representing a single dwelling under modelling: its identity, source data, enrichments, and modelling outputs.
+_Avoid_: dwelling, unit, home, asset
+
+**Properties**:
+A first-class collection of Property objects; the unit of bulk operation in services.
+_Avoid_: property list, batch (used for SQS chunks)
+
+**UPRN**:
+Unique Property Reference Number — the government-issued permanent identifier for a physical address in the UK.
+_Avoid_: property ID, address ID, code
+
+**Postcode**:
+A UK postal code used to group nearby addresses; the primary search key for finding EPC records.
+_Avoid_: zip code, postal code
+
+**User Address**:
+A free-text address string provided by a user or imported from a customer dataset, before any normalisation or matching.
+_Avoid_: user input, raw address, user_inputed_address
+
+**Comparable Properties**:
+The reference cohort matched to a target Property by both geographic proximity (postcode prefix / UPRN range) and physical similarity (property type, built form, age band); used by the EPC Prediction Service for gap-filling and anomaly detection.
+_Avoid_: neighbours, similar properties, peer set
+
+### Source data
+
+**Site Notes**:
+The full-coverage record produced by a Domna survey of a single Property; carries every EPC field the modelling pipeline requires, and when present supersedes the public EPC for that Property — except when the public EPC is newer.
+_Avoid_: energy assessment, site survey, field survey, Domna survey, Hestia survey
+
+**Landlord Overrides**:
+Property data supplied by a landlord that may correct or supplement the public EPC for a single Property; triggers Rebaselining when applied; not applicable when Site Notes are present.
+_Avoid_: patches (deprecated), corrections, manual EPC, edits
+
+### Modelling
+
+**Effective EPC**:
+The EpcPropertyData scored by the modelling pipeline for a single Property, derived from either Site Notes alone or the public EPC with Landlord Overrides applied; carries source-derived physical fields and originally recorded performance values, with model-rebaselined performance held separately in Baseline Performance.
+_Avoid_: modelling EPC, working EPC, resolved EPC, derived EPC
+
+**Rebaselining**:
+Recomputing a Property's Baseline Performance via ML when its Effective EPC diverges from the originally lodged public EPC, or when no previous baseline exists.
+_Avoid_: re-scoring, re-prediction, performance recomputation
+
+**Baseline Performance**:
+The set of ML-predicted performance values for a single Property — SAP, carbon emissions, heat demand, annual kWh — produced by scoring the Effective EPC against the kWh model; distinct from the originally recorded performance fields on the Effective EPC.
+_Avoid_: baseline predictions, predicted baseline, rebaselined values
+
+**EPC Anomaly Flag**:
+A per-field indicator that a Property's value for an EPC field differs significantly from Comparable Properties; advisory only — surfaces in the UI to prompt user review, does not block modelling.
+_Avoid_: outlier, mismatch, divergence flag
+
+### Outputs
+
+**Scenario**:
+A named portfolio-level container for a single modelling run, capturing the goal (e.g. Increasing EPC), budget, exclusions, and housing type; holds many Plans.
+_Avoid_: project, batch, run-set
+
+**Plan**:
+The per-Property output of a single modelling run; belongs to one Scenario and carries the Property's full Recommendation list, Optimised Package, and post-retrofit predictions.
+_Avoid_: recommendation set, output, result
+
+**Recommendation**:
+A single proposed retrofit measure for a Property, with its cost, SAP impact, kWh savings, carbon savings, and parts list.
+_Avoid_: suggestion, option
+
+**Optimised Package**:
+The subset of a Property's Recommendations selected by the Optimiser Service for installation, chosen to satisfy the Scenario's goal subject to budget.
+_Avoid_: selected measures, default measures, optimal solution, recommended bundle
+
+**Measure Type**:
+The catalogue classification of a retrofit measure (e.g. `solar_pv`, `loft_insulation`, `ashp`); one or more Recommendations reference the same Measure Type with property-specific cost and impact.
+_Avoid_: measure (ambiguous), category
+
+### Address matching
+
+**Lexiscore**:
+A similarity score in [0, 1] between a User Address and a candidate EPC address; combines token overlap and character-level similarity.
+_Avoid_: score, match score, similarity
+
+**Lexirank**:
+Dense rank of candidates sorted by Lexiscore descending; rank 1 = best match.
+_Avoid_: rank, position
+
+**UPRN Candidate**:
+An EPC Search Result that is a plausible match for a given User Address, before scoring decides the winner.
+_Avoid_: match candidate, result
+
+**Score Threshold**:
+The minimum Lexiscore (currently 0.6) below which no match is returned even if a candidate exists.
+_Avoid_: minimum score, cutoff
+
+**Ambiguous Match**:
+A matching outcome where two or more candidates share Lexirank 1, making it impossible to select a unique winner.
+_Avoid_: tie, draw, duplicate
+
+**Best Match**:
+The single UPRN Candidate with Lexirank 1 that meets or exceeds the Score Threshold.
+_Avoid_: winner, top result
+
+### API and integration
+
+**EPC Search Result**:
+A lightweight record returned by the government domestic search endpoint — address lines, postcode, UPRN, band, and certificate number, but not full certificate data.
+_Avoid_: search row, EPC row, result
+
+**EPC Property Data**:
+The fully mapped domain object produced after fetching and parsing a complete EPC certificate; the schema the modelling pipeline operates against.
+_Avoid_: EPC data, certificate data, parsed EPC
+
+**Old EPC API**:
+The retired government API (`epc.opendatacommunities.org`) using HTTP Basic auth; decommissioned 30 May 2026.
+_Avoid_: legacy API
+
+**New EPC API**:
+The replacement government API (`api.get-energy-performance-data.communities.gov.uk`) using Bearer Token auth.
+_Avoid_: new API, current API
+
+**Bearer Token**:
+The auth credential required by the New EPC API; stored in the `EPC_AUTH_TOKEN` environment variable.
+_Avoid_: API key, auth token, secret
+
+## Relationships
+
+- A **Property** represents a single physical dwelling for modelling; identified by `(portfolio_id, UPRN)` or `(portfolio_id, landlord_property_id)`.
+- A **Property** has zero or more **EPCs** across time, exactly one **Effective EPC**, zero or one set of **Site Notes**, and zero or one set of **Landlord Overrides**.
+- An **EPC** belongs to exactly one **Property** and has one **Certificate Number**.
+- An **EPC** carries an **EPC Band** and is identifiable by its **Registration Date**; the most recent one is the current.
+- A **UPRN** identifies a physical dwelling permanently; it does not change when the property changes owner — but each portfolio gets its own **Property** keyed against it.
+- When a **Property** has both **Site Notes** and a public **EPC**, the newer of the two derives the **Effective EPC**. **Landlord Overrides** apply only when the **EPC** is the source — never when **Site Notes** are.
+- **Rebaselining** produces **Baseline Performance** for a Property; triggered when the **Effective EPC** diverges from the originally lodged EPC (because of **Site Notes**, **Landlord Overrides**, an expired EPC, or an estimated EPC).
+- The **EPC Prediction Service** uses **Comparable Properties** for both gap-filling and producing **EPC Anomaly Flags**.
+- A **Scenario** contains many **Plans** (one per Property). A **Plan** carries many **Recommendations**; the **Optimised Package** is the subset selected for installation.
+- A **Recommendation** references one **Measure Type** and carries property-specific cost and impact.
+- **Address Matching** uses a **User Address** and **Postcode** to find a **UPRN** by scoring **UPRN Candidates** from an EPC search. A **Lexirank** of 1 with no **Ambiguous Match** and a **Lexiscore** ≥ the **Score Threshold** produces a **Best Match**.
+
+## Example dialogue
+
+> **Dev:** "A landlord uploads a corrected boiler for one of their properties. What happens?"
+>
+> **Domain expert:** "That's a **Landlord Override** on the heating fields. Save it against the **Property**, then trigger **Rebaselining** — the **Effective EPC** has changed, so we need fresh **Baseline Performance** before regenerating **Recommendations**."
+
+> **Dev:** "What if the same Property also has Site Notes?"
+>
+> **Domain expert:** "**Site Notes** supersede the public **EPC**, so **Landlord Overrides** don't apply. We model from the **Site Notes** version of the **Effective EPC**. If the public **EPC** is newer than the **Site Notes**, that's the one exception — we use the newer one."
+
+> **Dev:** "After modelling we end up with a list of measures. Which ones get installed?"
+>
+> **Domain expert:** "The **Optimiser Service** picks the **Optimised Package** — a subset of **Recommendations** that hits the **Scenario** goal within budget. The rest stay in the **Plan** as alternatives the user can swap in."
+
+> **Dev:** "I'm looking at a property where the EPC says cavity walls but every other house on the street has solid. Is that a bug?"
+>
+> **Domain expert:** "That's an **EPC Anomaly Flag**. We compute it against the **Comparable Properties** for that postcode. It's advisory — the UI surfaces it and the landlord can apply a **Landlord Override** if it's wrong."
+
+## Flagged ambiguities
+
+- **"property"** was historically warned against in favour of "dwelling"; that has been inverted. **Property** is now canonical for the Ara domain aggregate. Legacy code still uses "dwelling" in places — treat as alias.
+- **"energy assessment"** in the existing codebase (`energy_assessment_functions`, `energy_assessments_by_uprn`) refers to what is now canonically called **Site Notes**. New code uses **Site Notes**.
+- **"patch"** / `patch_epc` in the existing codebase has been merged into **Landlord Overrides**; the original concept is deprecated.
+- **"already_installed measures"** in the existing codebase is likely subsumed by **Landlord Overrides** ("we have a heat pump now" → override the heating fields). Final call deferred to implementation.
+- **"address"** appears as both the raw **User Address** (free-text) and a structured field on an **EPC Search Result** (normalised lines). Always qualify: "user address" vs "EPC address" or "address line 1".
+- **"score"** is used for `AddressMatch.score()` output, the `lexiscore` column, and informally. Prefer **Lexiscore** in domain discussions; reserve "score" for method-level code comments.
+- **"user_inputed_address"** in `backend/address2UPRN/main.py` is a misspelling and a synonym for **User Address** — the canonical term. New code should use `user_address`.
+- **"EPC"** is overloaded as both the document and the rating band letter. Use **EPC** for the document, **EPC Band** for the letter.
+- **"re-scoring"** has two meanings in the codebase — **Rebaselining** (re-predicting baseline performance after an EPC change) and post-optimisation measure re-prediction. Prefer **Rebaselining** for the former; for the latter, the **Optimiser Service** step does its own scoring without a special name.
--- a/docs/adr/0001-two-source-paths.md
+++ b/docs/adr/0001-two-source-paths.md
@ -0,0 +1,10 @@
+# Two source paths for a Property, not layered precedence
+
+For modelling a Property we considered a strict layered precedence stack — `patches > site_notes > energy_assessment > epc > predicted` — with per-field provenance tracking. We rejected that in favour of **two strictly disjoint source paths**: a Property is modelled either from its Site Notes alone, or from the public EPC with Landlord Overrides applied on top. Site Notes are committed to being full-coverage by the domain ([CONTEXT.md](../../CONTEXT.md): _Site Notes_), so once we have them the EPC is irrelevant; conversely, Landlord Overrides are only meaningful when the EPC is the source of physical state.
+
+The trade-off: layered precedence is more flexible (it tolerates a partial Site Notes survey by falling through to EPC for missing fields), but mixed-source data muddles the audit trail and undermines the "if we surveyed it, trust the survey" promise. The two-path model gives a cleaner derivation rule and an unambiguous source-of-truth per Property, at the cost of treating survey gaps as a survey-quality bug rather than a fallback signal. A Recency Tie-Break covers the one case where both exist: the newer of the two wins.
+
+## Consequences
+
+- Reversing this means rewriting `Property.effective_epc` and every service that reads it. Hard to roll back once 12 services depend on the two-path shape.
+- Future addition of a third path (e.g. partial-survey) is a real change, not just a config tweak — flag it as an ADR if proposed.
--- a/docs/adr/0002-property-aggregate-root.md
+++ b/docs/adr/0002-property-aggregate-root.md
@ -0,0 +1,14 @@
+# `Property` is the aggregate root, not `EpcPropertyData`
+
+The Ara modelling pipeline produces nine slices of per-property data (EPC, geospatial, solar, baseline performance, recommendations, optimised package, etc.). We considered making `EpcPropertyData` — the rich RdSAP-21-style EPC schema — the centrepiece, with other data hanging off it. We rejected that and introduced a new **`Property` aggregate root** that holds identity, all source data (EPC, Site Notes, Landlord Overrides), enrichments, and modelling outputs as named fields. Services take `Property` (or `Properties`) and return them with one slice populated.
+
+Two reasons drove this:
+1. **Geospatial, solar, recommendations, and overrides are peers to the EPC**, not properties of it. Putting them on `EpcPropertyData` conflates physical-state schema with modelling-run state.
+2. **A typed `ModellingContext` dict-bag (the obvious alternative)** is exactly what the current legacy `Property` class became — 1259 lines of accumulated stuff, hard to read, hard to test, hard to extend. Named fields on a dataclass force the type system to keep us honest.
+
+The cost is more domain types up front (`Property`, `Properties`, `PropertyIdentity`, `BaselinePerformance`, `OptimisedPackage`, etc.) and the discipline of one service writing one slice. The benefit is that every service has a single job and every test injects fake repos against a small, named structure.
+
+## Consequences
+
+- Every service signature accepts or returns `Property` / `Properties`. Refactoring later means touching all of them.
+- `EpcPropertyData` stays a pure physical-state schema (defined in [datatypes/epc/domain/epc_property_data.py](../../datatypes/epc/domain/epc_property_data.py)) — no modelling outputs or run state on it.
--- a/docs/adr/0003-strict-ingestion-modelling-separation.md
+++ b/docs/adr/0003-strict-ingestion-modelling-separation.md
@ -0,0 +1,13 @@
+# Strict separation between Ingestion and Modelling
+
+Data flows one way only: **Ingestion → Repos → Modelling**. Modelling services never make external HTTP calls; Ingestion services never run business logic. If Modelling needs fresh data, it sees a stale record in a repo and returns; the caller (a refresh orchestrator or the FE) decides whether to ingest first. We considered allowing modelling services to call fetchers directly on cache miss — convenient — and rejected it.
+
+The trade-off is that modelling cannot "self-heal" by going to the gov EPC API when it finds stale data. The benefit is that modelling becomes a deterministic function of repository state: same Property in the repos, same modelling output. That is the property that makes modelling unit-testable against fakes (no DB, no network, no ML lambda), reproducible, and debuggable. It also enables a per-property UI flow where fetched data is shown to the user for review and possible override **before** modelling runs.
+
+Under the rushed timeline this constraint is more valuable, not less. Mixing fetchers into services is the easy thing to do when shipping fast; once it's done it's hard to extract.
+
+## Consequences
+
+- Every modelling service depends only on Repos (and other Services / domain logic). No HTTP libraries in the modelling import graph.
+- A `RefreshOrchestrator` is the only thing that calls Ingestion then Modelling in sequence; nothing else may.
+- "Modelling is stale, refetch in-line" is a forbidden pattern — surface staleness, do not silently repair it.