docs(adr): EPC Prediction from Comparable Properties (ADR-0029)

Grill-with-docs outcome: deterministic neighbour synthesis (NOT ML) of an EPC-less Property's EpcPropertyData picture, scored via Sap10Calculator. Six decisions — predict-components-not-SAP; deterministic k-NN; fetch-phase fallback behind a pure EpcPrediction service + ComparableProperties port; hybrid synthesis (cohort-mode categoricals + coherent template structure + overrides); filter-then-relax cohort weighted geo x recency x similarity; dual-use gap-fill + anomaly flags. Frozen postcode-clustered corpus backs leave-one-out validation. CONTEXT.md: new EPC Prediction term, Comparable Properties refined, ML framing corrected. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 13:10:47 +00:00 · 2026-06-13 23:36:19 +00:00 · 2026-06-13 23:36:19 +00:00 · 008a1b2783
commit 008a1b2783
parent 015ab9d17b
2 changed files with 57 additions and 5 deletions
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -62,9 +62,13 @@ A structured dataclass (`domain.addresses.user_address.UserAddress`) capturing a
 _Avoid_: user input, raw address, user_inputed_address

 **Comparable Properties**:
-The reference cohort matched to a target Property by both geographic proximity (postcode prefix / UPRN range) and physical similarity (property type, built form, age band); used by the EPC Prediction Service for gap-filling and anomaly detection.
+The reference cohort matched to a target Property, used by **EPC Prediction** for gap-filling and anomaly detection. Selected by a **filter-then-relax ladder**: hard filters on identity (property type, built form) and any known **Landlord Override** (e.g. a known solid-brick wall) while at least *k* remain, widening the geographic scope (postcode → postcode-prefix) or demoting a known field to a weight when sparse. Survivors are weighted for prediction by **geographic proximity × recency × physical similarity** — closer, newer (newer EPCs are higher quality), more-similar comparables count more; pre-SAP10 / very old certs are dropped.
 _Avoid_: neighbours, similar properties, peer set

+**EPC Prediction**:
+Producing a Property's `EpcPropertyData` picture from its **Comparable Properties** when it has no EPC (~30% of UK homes, typically long-tenure). **Deterministic** neighbour synthesis (k-NN-style — *not* ML; no trained model): take the cohort **mode** for the homogeneous categoricals (wall / roof / floor construction + insulation, construction age band), copy a single representative comparable's **structure** wholesale (building parts, per-window dimensions + orientations, floor dimensions) so the picture stays internally consistent for the calculator, then apply **Landlord Overrides** and the known inputs on top. The result is scored through **SAP10 Calculation** like any other **Effective EPC**, so a predicted Property flows through Rebaselining, Bill Derivation, and Modelling unchanged — marked as predicted so the UI can flag it. The same cohort machinery also produces **EPC Anomaly Flags** for Properties that *do* have an EPC. A future learned-weighting refinement is possible but separate, as with the calculator's ML residual head.
+_Avoid_: EpcPredictionService (no "service" suffix — name the operation), ML prediction (it is deterministic), EPC estimation
+
 ### Survey documents

 **Ventilation Audit**:
@ -88,15 +92,15 @@ _Avoid_: patches (deprecated), corrections, manual EPC, edits
 ### Modelling

 **Effective EPC**:
-The assembled `EpcPropertyData` picture the modelling pipeline scores for a single Property. Assembled from whichever source applies: Site Notes alone; or the public EPC with **Landlord Overrides** applied; or — when the EPC is **old** — its schema re-mapped to current via **Reduced-Field Synthesis** (deterministic, from the cert plus calibrated coefficients — no neighbour data); or — when there is **no EPC** — components **estimated from surrounding properties** (a separate neighbour-prediction ML mechanism, not yet implemented). Carries source-derived physical fields and originally recorded performance values; the performance scored from this picture is held separately in **Baseline Performance**.
+The assembled `EpcPropertyData` picture the modelling pipeline scores for a single Property. Assembled from whichever source applies: Site Notes alone; or the public EPC with **Landlord Overrides** applied; or — when the EPC is **old** — its schema re-mapped to current via **Reduced-Field Synthesis** (deterministic, from the cert plus calibrated coefficients — no neighbour data); or — when there is **no EPC** — components **estimated from surrounding properties** via **EPC Prediction** (deterministic neighbour synthesis). Carries source-derived physical fields and originally recorded performance values; the performance scored from this picture is held separately in **Baseline Performance**.
 _Avoid_: modelling EPC, working EPC, resolved EPC, derived EPC

 **Rebaselining**:
-Establishing a Property's **Effective Performance** (SAP score, EPC Band, CO2, Primary Energy Intensity, space-heating & hot-water kWh) by **assembling the Effective EPC picture and scoring it** through **SAP10 Calculation** (the deterministic `Sap10Calculator`, which superseded the old ML-API rebaseliner; an ML residual head over the calculator is future — ADR-0009/0013). The *assembly* is the substance: apply **Landlord Overrides** (e.g. boiler → ASHP, wall insulated) as a simulation on the `EpcPropertyData`; re-map an old-schema EPC to current via **Reduced-Field Synthesis** (deterministic, cert-only); estimate components from surrounding properties when there is no EPC (neighbour-prediction gap-fill — a separate ML mechanism, not yet implemented). The calculator is the **scoring engine at the tail**, not the whole of Rebaselining — so its call lives inside the Rebaseliner, after assembly. Triggered whenever the assembled picture differs from the lodged record: (a) the EPC was lodged under a methodology the calculator supersedes (`sap_version < 10.2`), (b) Overrides / Site Notes changed the physical state (walls / heating / windows / etc.), or (c) the picture is estimated or remapped rather than a real current EPC. Produces Effective Performance; Lodged Performance is preserved unchanged. The same single scoring also yields the per-end-use kWh that **Bill Derivation** prices — one scoring, two products. kWh is an ML target per ADR-0007 — see [[epc-ml-transform]].
+Establishing a Property's **Effective Performance** (SAP score, EPC Band, CO2, Primary Energy Intensity, space-heating & hot-water kWh) by **assembling the Effective EPC picture and scoring it** through **SAP10 Calculation** (the deterministic `Sap10Calculator`, which superseded the old ML-API rebaseliner; an ML residual head over the calculator is future — ADR-0009/0013). The *assembly* is the substance: apply **Landlord Overrides** (e.g. boiler → ASHP, wall insulated) as a simulation on the `EpcPropertyData`; re-map an old-schema EPC to current via **Reduced-Field Synthesis** (deterministic, cert-only); estimate components from surrounding properties when there is no EPC (**EPC Prediction** — deterministic neighbour gap-fill). The calculator is the **scoring engine at the tail**, not the whole of Rebaselining — so its call lives inside the Rebaseliner, after assembly. Triggered whenever the assembled picture differs from the lodged record: (a) the EPC was lodged under a methodology the calculator supersedes (`sap_version < 10.2`), (b) Overrides / Site Notes changed the physical state (walls / heating / windows / etc.), or (c) the picture is estimated or remapped rather than a real current EPC. Produces Effective Performance; Lodged Performance is preserved unchanged. The same single scoring also yields the per-end-use kWh that **Bill Derivation** prices — one scoring, two products. kWh is an ML target per ADR-0007 — see [[epc-ml-transform]].
 _Avoid_: re-scoring, re-prediction, performance recomputation, refresh (for cache-freshness)

 **Reduced-Field Synthesis**:
-Deterministically translating an **old / reduced-data EPC schema** into the current `EpcPropertyData`, synthesising the *measured* fields the target expects from the source's *reduced or categorical* fields, using only the cert itself plus fixed calibrated coefficients — never neighbour data. Used when re-mapping a **pre-SAP10** cert (e.g. `RdSAP-Schema-20.0.0`) as part of assembling the **Effective EPC**: e.g. a glazing-area *band* + floor area → window m²; bath/shower *room counts* → bath and shower counts. A *best attempt* with no ground truth to validate against (per the **Validation Cohort** rule, a pre-SAP10 cert has no same-spec lodged figure to check), so each synthesis assumption is recorded explicitly in code and tests to keep it debuggable. Distinct from **neighbour-prediction gap-fill** (ML estimation of genuinely-absent fields from surrounding properties — the no-EPC path, a separate mechanism not yet implemented) and from the calculator's own RdSAP Table-5 defaulting in `cert_to_inputs` (which expands `EpcPropertyData` into the full SAP input set downstream).
+Deterministically translating an **old / reduced-data EPC schema** into the current `EpcPropertyData`, synthesising the *measured* fields the target expects from the source's *reduced or categorical* fields, using only the cert itself plus fixed calibrated coefficients — never neighbour data. Used when re-mapping a **pre-SAP10** cert (e.g. `RdSAP-Schema-20.0.0`) as part of assembling the **Effective EPC**: e.g. a glazing-area *band* + floor area → window m²; bath/shower *room counts* → bath and shower counts. A *best attempt* with no ground truth to validate against (per the **Validation Cohort** rule, a pre-SAP10 cert has no same-spec lodged figure to check), so each synthesis assumption is recorded explicitly in code and tests to keep it debuggable. Distinct from **EPC Prediction** (deterministic neighbour estimation of genuinely-absent fields from surrounding properties — the no-EPC path) and from the calculator's own RdSAP Table-5 defaulting in `cert_to_inputs` (which expands `EpcPropertyData` into the full SAP input set downstream).
 _Avoid_: gap-fill (means the neighbour-ML path), reduced-data expansion (overloaded with the calculator's Table-5 step), remapping (the schema-translation part only)

 **Baseline Performance**:
@ -361,7 +365,7 @@ _Avoid_: API key, auth token, secret
 - A Property's **Baseline Performance** holds two halves: **Lodged Performance** (the gov register's SAP / band / carbon / heat) and **Effective Performance** (what the modelling pipeline scored against). The two are equal unless **Rebaselining** fires.
 - **Rebaselining** produces **Effective Performance** by ML re-prediction across SAP score, CO2 emissions, Primary Energy Intensity, space heating kWh, and hot water kWh, when either (a) the Effective EPC was lodged under a pre-SAP10 schema, or (b) the Effective EPC's physical state diverges from the lodged EPC. **Lodged Performance** is never overwritten.
 - **Bill Derivation** derives **fuel split** and **bills** from kWh values (sourced from the EPC's `renewable_heat_incentive` fields for baseline SAP10 properties, or from ML when Rebaselining fires), reading current **Fuel Rates** and **Carbon Factors** from their respective repos.
- The **EPC Prediction Service** uses **Comparable Properties** for both gap-filling and producing **EPC Anomaly Flags**.
+- **EPC Prediction** uses **Comparable Properties** for both gap-filling (the no-EPC path) and producing **EPC Anomaly Flags** (the has-EPC path).
 - Triggering the model against N **Scenarios** produces N **Plans** per Property. Each **Plan** holds one **Optimised Package** — its selected **Plan Measures** — plus the Property's post-retrofit figures.
 - A **Scenario Snapshot** is pinned at trigger time per (task, scenario) so mid-run edits to the live Scenario do not affect an in-flight modelling job.
 - A **Recommendation** references one **Measure Type** and carries property-specific cost and impact.
--- a/docs/adr/0029-epc-prediction-from-comparable-properties.md
+++ b/docs/adr/0029-epc-prediction-from-comparable-properties.md
@ -0,0 +1,48 @@
+# EPC Prediction from Comparable Properties
+
+~30% of UK homes (typically long-tenure) have no EPC. **EPC Prediction** produces a Property's `EpcPropertyData` picture from its **Comparable Properties** so an EPC-less Property flows through the rest of the pipeline (Rebaselining, Bill Derivation, Modelling) unchanged. This records the load-bearing design decisions taken in a grill-with-docs session.
+
+## Status
+
+Accepted (design). Implementation pending.
+
+## Decisions
+
+### 1. Predict the physical picture, score it with our calculator — never a SAP scalar
+
+Prediction outputs a structured `EpcPropertyData` (building parts, windows, floor dimensions, construction + insulation, age band); SAP / CO2 / PEI / per-end-use kWh come from running `Sap10Calculator` on it. This is the same "assemble a picture, score once" mechanic as every other **Effective EPC** path (Landlord Overrides, Reduced-Field Synthesis), so a predicted Property is fully usable downstream (bills, measures, optimiser) — a directly-aggregated SAP scalar (legacy `SearchEpc`) would be a dead-end number. It also makes the component-classification accuracy metrics meaningful and keeps errors traceable to a wrong component rather than an opaque regression.
+
+### 2. Deterministic neighbour synthesis, not ML
+
+No trained model, no learned weights, no fit pipeline: filter a cohort, take categorical modes, copy a representative template, apply overrides. CONTEXT's prior "ML mechanism" framing is corrected — calling it ML invited the wrong architecture (training data, model artifacts, retraining). A future *learned-weighting* refinement is possible but separate, mirroring the calculator's flagged-future ML residual head. The domain class is `EpcPrediction` (no "Service" suffix, per the `BillDerivation` convention).
+
+### 3. Fetch-phase fallback, behind a domain service + a cohort repository port
+
+A pure **`EpcPrediction`** domain service (cohort of comparable `EpcPropertyData` in → predicted `EpcPropertyData` out) sits behind a **`ComparableProperties`** repository port that owns the cohort IO (postcode search → per-cert fetch, cached). It wires into `IngestionOrchestrator._fetch`: when `epc_fetcher.get_by_uprn` returns `None`, fetch the cohort and predict, persisting the picture **marked as predicted** (so the UI flags it and the Validation Cohort excludes it). Baseline and Modelling are untouched. Chosen over a fetcher-decorator (hides a heavy cohort fetch behind `get_by_uprn`) and a dedicated stage (a whole stage for "fill the gap when absent", duplicating IO ingestion already does). The heavy cohort IO stays visible in the no-unit IO phase.
+
+### 4. Hybrid synthesis: cohort-mode categoricals + a coherent structural template
+
+You cannot average a list of windows (counts differ; a mean orientation is meaningless) or building parts. So:
+- **Homogeneous categoricals** (wall / roof / floor construction + insulation, age band) → cohort **mode** (robust to one oddball; drives the classification-rate metrics).
+- **Structure & geometry** (building parts, per-window dimensions + orientations, floor dimensions) → copied wholesale from a **single representative comparable** chosen to be consistent with those modes and closest on geo + size (internally consistent for the calculator; drives the window-area / building-parts / floor-area residual metrics).
+- **Landlord Overrides** and the known inputs are applied **on top**.
+
+Rejected: field-by-field aggregation (legacy — incoherent, may not score sensibly) and single-nearest-neighbour copied wholesale (one atypical neighbour sets the categoricals → weaker classification).
+
+### 5. Cohort selection: filter-then-relax ladder, weighted by geo × recency × similarity
+
+Selection hard-filters on identity (property type, built form) and any **known Landlord Override** (e.g. solid brick — the mixed-street border case) **while ≥ k comparables remain**, widening the geographic scope (postcode → postcode-prefix) or demoting a known to a strong weight when sparse. Survivors are weighted by **geographic proximity** (true coordinates via `GeospatialRepository`, not the legacy house-number proxy) **× recency** (newer EPCs are higher quality) **× physical similarity**; pre-SAP10 / very old certs are dropped. So a known field acts twice: upstream on cohort selection, and again as an override on the final picture.
+
+### 6. Dual use: gap-fill (no EPC) and anomaly flags (has EPC)
+
+The same cohort + comparison machinery produces **EPC Anomaly Flags** for Properties that *do* have an EPC (e.g. "all neighbours are 1930s; this lodges 1950 — correct?") — advisory, surfaced for user review. The no-EPC gap-fill lands first; the always-on anomaly-flag wiring is a follow-on increment.
+
+## Validation
+
+A **frozen postcode-clustered corpus** (a one-off fetch caches N postcodes × all their certs as `EpcPropertyData`) backs an offline, deterministic, repeatable **leave-one-out** harness over thousands of properties: drop a property with an EPC from its own cohort, predict it, compare predicted vs actual. Metrics: **classification rate** on wall / roof / floor construction + insulation and construction age band; **residuals** on SAP, total window area + window count, building-parts count, total floor area. SAP is reported three ways to attribute error — predicted-then-calculated vs (a) lodged SAP (end-to-end), (b) calculator-on-actual-components (isolates prediction error), (c) a direct neighbour-mean SAP baseline (proves predict-then-calculate beats naïve averaging).
+
+## Open (implementation-level)
+
+- **Provenance marker** on the picture (predicted vs real) — exact representation TBD; needed for the UI flag and Validation Cohort exclusion.
+- **No-cohort fallback** when zero comparables survive even after relaxing (low-confidence national property-type + age defaults, or skip-with-flag).
+- **Confidence signal** (cohort size + agreement) carried for the UI and anomaly thresholds.