From d9c169608528475e0ed33048b8e637521143b9ad Mon Sep 17 00:00:00 2001 From: Khalim Conn-Kowlessar Date: Wed, 13 May 2026 21:26:18 +0000 Subject: [PATCH] added architechtural decisions, added to prd --- CONTEXT.md | 208 ++++++++++++++++++ docs/adr/0001-two-source-paths.md | 10 + docs/adr/0002-property-aggregate-root.md | 14 ++ ...3-strict-ingestion-modelling-separation.md | 13 ++ 4 files changed, 245 insertions(+) create mode 100644 CONTEXT.md create mode 100644 docs/adr/0001-two-source-paths.md create mode 100644 docs/adr/0002-property-aggregate-root.md create mode 100644 docs/adr/0003-strict-ingestion-modelling-separation.md diff --git a/CONTEXT.md b/CONTEXT.md new file mode 100644 index 00000000..bd71d6b5 --- /dev/null +++ b/CONTEXT.md @@ -0,0 +1,208 @@ +# Ara + +The Domna product for domestic retrofit modelling: ingests open-source EPC data, lets users correct or supersede it with their own surveys, and produces optimised retrofit packages for each property in a portfolio. + +## Language + +### Product + +**Ara**: +The Domna product. Latin for "the altar"; named under Domna's classical-naming convention. Covers both the modelling product and the backend that powers it. +_Avoid_: ARA (acronym style), v2 backend, the new backend + +**Domna**: +The company. Roman name; sibling to Ara in the same naming convention. + +### Energy Performance Certificates + +**EPC**: +An Energy Performance Certificate — a government-issued document rating a dwelling's energy efficiency from A (best) to G (worst). +_Avoid_: energy certificate, energy report + +**Certificate Number**: +The unique identifier assigned to an EPC by the government registry. +_Avoid_: cert number, EPC ID + +**Registration Date**: +The date an EPC was lodged with the government register; used to identify the most recent certificate for a property. +_Avoid_: assessment date, submission date + +**EPC Band**: +A single letter A–G representing a property's current or potential energy efficiency rating. +_Avoid_: energy rating, EPC grade, EPC score + +**Schema Type**: +The versioned RdSAP or SAP schema that describes the structure of an EPC's raw data (e.g. `RdSAP-Schema-21.0.1`). +_Avoid_: schema version, EPC format + +**Domestic Certificate**: +An EPC issued for a residential dwelling, as opposed to a commercial one. +_Avoid_: residential EPC, home EPC + +### Properties and addresses + +**Property**: +The Ara domain aggregate representing a single dwelling under modelling: its identity, source data, enrichments, and modelling outputs. +_Avoid_: dwelling, unit, home, asset + +**Properties**: +A first-class collection of Property objects; the unit of bulk operation in services. +_Avoid_: property list, batch (used for SQS chunks) + +**UPRN**: +Unique Property Reference Number — the government-issued permanent identifier for a physical address in the UK. +_Avoid_: property ID, address ID, code + +**Postcode**: +A UK postal code used to group nearby addresses; the primary search key for finding EPC records. +_Avoid_: zip code, postal code + +**User Address**: +A free-text address string provided by a user or imported from a customer dataset, before any normalisation or matching. +_Avoid_: user input, raw address, user_inputed_address + +**Comparable Properties**: +The reference cohort matched to a target Property by both geographic proximity (postcode prefix / UPRN range) and physical similarity (property type, built form, age band); used by the EPC Prediction Service for gap-filling and anomaly detection. +_Avoid_: neighbours, similar properties, peer set + +### Source data + +**Site Notes**: +The full-coverage record produced by a Domna survey of a single Property; carries every EPC field the modelling pipeline requires, and when present supersedes the public EPC for that Property — except when the public EPC is newer. +_Avoid_: energy assessment, site survey, field survey, Domna survey, Hestia survey + +**Landlord Overrides**: +Property data supplied by a landlord that may correct or supplement the public EPC for a single Property; triggers Rebaselining when applied; not applicable when Site Notes are present. +_Avoid_: patches (deprecated), corrections, manual EPC, edits + +### Modelling + +**Effective EPC**: +The EpcPropertyData scored by the modelling pipeline for a single Property, derived from either Site Notes alone or the public EPC with Landlord Overrides applied; carries source-derived physical fields and originally recorded performance values, with model-rebaselined performance held separately in Baseline Performance. +_Avoid_: modelling EPC, working EPC, resolved EPC, derived EPC + +**Rebaselining**: +Recomputing a Property's Baseline Performance via ML when its Effective EPC diverges from the originally lodged public EPC, or when no previous baseline exists. +_Avoid_: re-scoring, re-prediction, performance recomputation + +**Baseline Performance**: +The set of ML-predicted performance values for a single Property — SAP, carbon emissions, heat demand, annual kWh — produced by scoring the Effective EPC against the kWh model; distinct from the originally recorded performance fields on the Effective EPC. +_Avoid_: baseline predictions, predicted baseline, rebaselined values + +**EPC Anomaly Flag**: +A per-field indicator that a Property's value for an EPC field differs significantly from Comparable Properties; advisory only — surfaces in the UI to prompt user review, does not block modelling. +_Avoid_: outlier, mismatch, divergence flag + +### Outputs + +**Scenario**: +A named portfolio-level container for a single modelling run, capturing the goal (e.g. Increasing EPC), budget, exclusions, and housing type; holds many Plans. +_Avoid_: project, batch, run-set + +**Plan**: +The per-Property output of a single modelling run; belongs to one Scenario and carries the Property's full Recommendation list, Optimised Package, and post-retrofit predictions. +_Avoid_: recommendation set, output, result + +**Recommendation**: +A single proposed retrofit measure for a Property, with its cost, SAP impact, kWh savings, carbon savings, and parts list. +_Avoid_: suggestion, option + +**Optimised Package**: +The subset of a Property's Recommendations selected by the Optimiser Service for installation, chosen to satisfy the Scenario's goal subject to budget. +_Avoid_: selected measures, default measures, optimal solution, recommended bundle + +**Measure Type**: +The catalogue classification of a retrofit measure (e.g. `solar_pv`, `loft_insulation`, `ashp`); one or more Recommendations reference the same Measure Type with property-specific cost and impact. +_Avoid_: measure (ambiguous), category + +### Address matching + +**Lexiscore**: +A similarity score in [0, 1] between a User Address and a candidate EPC address; combines token overlap and character-level similarity. +_Avoid_: score, match score, similarity + +**Lexirank**: +Dense rank of candidates sorted by Lexiscore descending; rank 1 = best match. +_Avoid_: rank, position + +**UPRN Candidate**: +An EPC Search Result that is a plausible match for a given User Address, before scoring decides the winner. +_Avoid_: match candidate, result + +**Score Threshold**: +The minimum Lexiscore (currently 0.6) below which no match is returned even if a candidate exists. +_Avoid_: minimum score, cutoff + +**Ambiguous Match**: +A matching outcome where two or more candidates share Lexirank 1, making it impossible to select a unique winner. +_Avoid_: tie, draw, duplicate + +**Best Match**: +The single UPRN Candidate with Lexirank 1 that meets or exceeds the Score Threshold. +_Avoid_: winner, top result + +### API and integration + +**EPC Search Result**: +A lightweight record returned by the government domestic search endpoint — address lines, postcode, UPRN, band, and certificate number, but not full certificate data. +_Avoid_: search row, EPC row, result + +**EPC Property Data**: +The fully mapped domain object produced after fetching and parsing a complete EPC certificate; the schema the modelling pipeline operates against. +_Avoid_: EPC data, certificate data, parsed EPC + +**Old EPC API**: +The retired government API (`epc.opendatacommunities.org`) using HTTP Basic auth; decommissioned 30 May 2026. +_Avoid_: legacy API + +**New EPC API**: +The replacement government API (`api.get-energy-performance-data.communities.gov.uk`) using Bearer Token auth. +_Avoid_: new API, current API + +**Bearer Token**: +The auth credential required by the New EPC API; stored in the `EPC_AUTH_TOKEN` environment variable. +_Avoid_: API key, auth token, secret + +## Relationships + +- A **Property** represents a single physical dwelling for modelling; identified by `(portfolio_id, UPRN)` or `(portfolio_id, landlord_property_id)`. +- A **Property** has zero or more **EPCs** across time, exactly one **Effective EPC**, zero or one set of **Site Notes**, and zero or one set of **Landlord Overrides**. +- An **EPC** belongs to exactly one **Property** and has one **Certificate Number**. +- An **EPC** carries an **EPC Band** and is identifiable by its **Registration Date**; the most recent one is the current. +- A **UPRN** identifies a physical dwelling permanently; it does not change when the property changes owner — but each portfolio gets its own **Property** keyed against it. +- When a **Property** has both **Site Notes** and a public **EPC**, the newer of the two derives the **Effective EPC**. **Landlord Overrides** apply only when the **EPC** is the source — never when **Site Notes** are. +- **Rebaselining** produces **Baseline Performance** for a Property; triggered when the **Effective EPC** diverges from the originally lodged EPC (because of **Site Notes**, **Landlord Overrides**, an expired EPC, or an estimated EPC). +- The **EPC Prediction Service** uses **Comparable Properties** for both gap-filling and producing **EPC Anomaly Flags**. +- A **Scenario** contains many **Plans** (one per Property). A **Plan** carries many **Recommendations**; the **Optimised Package** is the subset selected for installation. +- A **Recommendation** references one **Measure Type** and carries property-specific cost and impact. +- **Address Matching** uses a **User Address** and **Postcode** to find a **UPRN** by scoring **UPRN Candidates** from an EPC search. A **Lexirank** of 1 with no **Ambiguous Match** and a **Lexiscore** ≥ the **Score Threshold** produces a **Best Match**. + +## Example dialogue + +> **Dev:** "A landlord uploads a corrected boiler for one of their properties. What happens?" +> +> **Domain expert:** "That's a **Landlord Override** on the heating fields. Save it against the **Property**, then trigger **Rebaselining** — the **Effective EPC** has changed, so we need fresh **Baseline Performance** before regenerating **Recommendations**." + +> **Dev:** "What if the same Property also has Site Notes?" +> +> **Domain expert:** "**Site Notes** supersede the public **EPC**, so **Landlord Overrides** don't apply. We model from the **Site Notes** version of the **Effective EPC**. If the public **EPC** is newer than the **Site Notes**, that's the one exception — we use the newer one." + +> **Dev:** "After modelling we end up with a list of measures. Which ones get installed?" +> +> **Domain expert:** "The **Optimiser Service** picks the **Optimised Package** — a subset of **Recommendations** that hits the **Scenario** goal within budget. The rest stay in the **Plan** as alternatives the user can swap in." + +> **Dev:** "I'm looking at a property where the EPC says cavity walls but every other house on the street has solid. Is that a bug?" +> +> **Domain expert:** "That's an **EPC Anomaly Flag**. We compute it against the **Comparable Properties** for that postcode. It's advisory — the UI surfaces it and the landlord can apply a **Landlord Override** if it's wrong." + +## Flagged ambiguities + +- **"property"** was historically warned against in favour of "dwelling"; that has been inverted. **Property** is now canonical for the Ara domain aggregate. Legacy code still uses "dwelling" in places — treat as alias. +- **"energy assessment"** in the existing codebase (`energy_assessment_functions`, `energy_assessments_by_uprn`) refers to what is now canonically called **Site Notes**. New code uses **Site Notes**. +- **"patch"** / `patch_epc` in the existing codebase has been merged into **Landlord Overrides**; the original concept is deprecated. +- **"already_installed measures"** in the existing codebase is likely subsumed by **Landlord Overrides** ("we have a heat pump now" → override the heating fields). Final call deferred to implementation. +- **"address"** appears as both the raw **User Address** (free-text) and a structured field on an **EPC Search Result** (normalised lines). Always qualify: "user address" vs "EPC address" or "address line 1". +- **"score"** is used for `AddressMatch.score()` output, the `lexiscore` column, and informally. Prefer **Lexiscore** in domain discussions; reserve "score" for method-level code comments. +- **"user_inputed_address"** in `backend/address2UPRN/main.py` is a misspelling and a synonym for **User Address** — the canonical term. New code should use `user_address`. +- **"EPC"** is overloaded as both the document and the rating band letter. Use **EPC** for the document, **EPC Band** for the letter. +- **"re-scoring"** has two meanings in the codebase — **Rebaselining** (re-predicting baseline performance after an EPC change) and post-optimisation measure re-prediction. Prefer **Rebaselining** for the former; for the latter, the **Optimiser Service** step does its own scoring without a special name. diff --git a/docs/adr/0001-two-source-paths.md b/docs/adr/0001-two-source-paths.md new file mode 100644 index 00000000..82615810 --- /dev/null +++ b/docs/adr/0001-two-source-paths.md @@ -0,0 +1,10 @@ +# Two source paths for a Property, not layered precedence + +For modelling a Property we considered a strict layered precedence stack — `patches > site_notes > energy_assessment > epc > predicted` — with per-field provenance tracking. We rejected that in favour of **two strictly disjoint source paths**: a Property is modelled either from its Site Notes alone, or from the public EPC with Landlord Overrides applied on top. Site Notes are committed to being full-coverage by the domain ([CONTEXT.md](../../CONTEXT.md): _Site Notes_), so once we have them the EPC is irrelevant; conversely, Landlord Overrides are only meaningful when the EPC is the source of physical state. + +The trade-off: layered precedence is more flexible (it tolerates a partial Site Notes survey by falling through to EPC for missing fields), but mixed-source data muddles the audit trail and undermines the "if we surveyed it, trust the survey" promise. The two-path model gives a cleaner derivation rule and an unambiguous source-of-truth per Property, at the cost of treating survey gaps as a survey-quality bug rather than a fallback signal. A Recency Tie-Break covers the one case where both exist: the newer of the two wins. + +## Consequences + +- Reversing this means rewriting `Property.effective_epc` and every service that reads it. Hard to roll back once 12 services depend on the two-path shape. +- Future addition of a third path (e.g. partial-survey) is a real change, not just a config tweak — flag it as an ADR if proposed. diff --git a/docs/adr/0002-property-aggregate-root.md b/docs/adr/0002-property-aggregate-root.md new file mode 100644 index 00000000..1114bc15 --- /dev/null +++ b/docs/adr/0002-property-aggregate-root.md @@ -0,0 +1,14 @@ +# `Property` is the aggregate root, not `EpcPropertyData` + +The Ara modelling pipeline produces nine slices of per-property data (EPC, geospatial, solar, baseline performance, recommendations, optimised package, etc.). We considered making `EpcPropertyData` — the rich RdSAP-21-style EPC schema — the centrepiece, with other data hanging off it. We rejected that and introduced a new **`Property` aggregate root** that holds identity, all source data (EPC, Site Notes, Landlord Overrides), enrichments, and modelling outputs as named fields. Services take `Property` (or `Properties`) and return them with one slice populated. + +Two reasons drove this: +1. **Geospatial, solar, recommendations, and overrides are peers to the EPC**, not properties of it. Putting them on `EpcPropertyData` conflates physical-state schema with modelling-run state. +2. **A typed `ModellingContext` dict-bag (the obvious alternative)** is exactly what the current legacy `Property` class became — 1259 lines of accumulated stuff, hard to read, hard to test, hard to extend. Named fields on a dataclass force the type system to keep us honest. + +The cost is more domain types up front (`Property`, `Properties`, `PropertyIdentity`, `BaselinePerformance`, `OptimisedPackage`, etc.) and the discipline of one service writing one slice. The benefit is that every service has a single job and every test injects fake repos against a small, named structure. + +## Consequences + +- Every service signature accepts or returns `Property` / `Properties`. Refactoring later means touching all of them. +- `EpcPropertyData` stays a pure physical-state schema (defined in [datatypes/epc/domain/epc_property_data.py](../../datatypes/epc/domain/epc_property_data.py)) — no modelling outputs or run state on it. diff --git a/docs/adr/0003-strict-ingestion-modelling-separation.md b/docs/adr/0003-strict-ingestion-modelling-separation.md new file mode 100644 index 00000000..68361ba9 --- /dev/null +++ b/docs/adr/0003-strict-ingestion-modelling-separation.md @@ -0,0 +1,13 @@ +# Strict separation between Ingestion and Modelling + +Data flows one way only: **Ingestion → Repos → Modelling**. Modelling services never make external HTTP calls; Ingestion services never run business logic. If Modelling needs fresh data, it sees a stale record in a repo and returns; the caller (a refresh orchestrator or the FE) decides whether to ingest first. We considered allowing modelling services to call fetchers directly on cache miss — convenient — and rejected it. + +The trade-off is that modelling cannot "self-heal" by going to the gov EPC API when it finds stale data. The benefit is that modelling becomes a deterministic function of repository state: same Property in the repos, same modelling output. That is the property that makes modelling unit-testable against fakes (no DB, no network, no ML lambda), reproducible, and debuggable. It also enables a per-property UI flow where fetched data is shown to the user for review and possible override **before** modelling runs. + +Under the rushed timeline this constraint is more valuable, not less. Mixing fetchers into services is the easy thing to do when shipping fast; once it's done it's hard to extract. + +## Consequences + +- Every modelling service depends only on Repos (and other Services / domain logic). No HTTP libraries in the modelling import graph. +- A `RefreshOrchestrator` is the only thing that calls Ingestion then Modelling in sequence; nothing else may. +- "Modelling is stale, refetch in-line" is a forbidden pattern — surface staleness, do not silently repair it.