docs(adr): ADR-0030 — SAP-version-aware, component-first EPC Prediction validation

Records the grilling-session decisions amending ADR-0029's validation: - Source cohort keeps all cert vintages (components are agnostic of the SAP methodology that rated them); only the held-out validation TARGET is restricted to SAP 10.2. Amends ADR-0029 decision 5 ("pre-SAP10 dropped"). - Component Accuracy (predicted vs API actual components) is the primary, calculator-independent signal. calc(predicted) vs calc(actual) rejected (circular ground truth, hides calculator error); neighbour-mean-lodged-SAP baseline rejected (mixes SAP versions). calc(predicted) vs API-lodged SAP/carbon/PE kept as a secondary, calculator-floored guard. - Two tiers: committed anonymized fixture (ratcheting CI gate) + bulk-export national battle-test on harness/epc_bulk.py + harness/cohort.py, emitting accuracy + a failure taxonomy, re-baselining the gate floors. CONTEXT.md: Comparable Properties corrected to all-vintage source; new Component Accuracy term. ADR-0029 Validation section marked superseded. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-30 13:10:47 +00:00 · 2026-06-14 08:47:58 +00:00 · 2026-06-14 08:47:58 +00:00 · 35a7c07812
commit 35a7c07812
parent c3d56b00dd
3 changed files with 60 additions and 2 deletions
--- a/CONTEXT.md
+++ b/CONTEXT.md
@ -62,7 +62,7 @@ A structured dataclass (`domain.addresses.user_address.UserAddress`) capturing a
 _Avoid_: user input, raw address, user_inputed_address

 **Comparable Properties**:
-The reference cohort matched to a target Property, used by **EPC Prediction** for gap-filling and anomaly detection. Selected by a **filter-then-relax ladder**: hard filters on identity (property type, built form) and any known **Landlord Override** (e.g. a known solid-brick wall) while at least *k* remain, widening the geographic scope (postcode → postcode-prefix) or demoting a known field to a weight when sparse. Survivors are weighted for prediction by **geographic proximity × recency × physical similarity** — closer, newer (newer EPCs are higher quality), more-similar comparables count more; pre-SAP10 / very old certs are dropped.
+The reference cohort matched to a target Property, used by **EPC Prediction** for gap-filling and anomaly detection. Selected by a **filter-then-relax ladder**: hard filters on identity (property type, built form) and any known **Landlord Override** (e.g. a known solid-brick wall) while at least *k* remain, widening the geographic scope (postcode → postcode-prefix) or demoting a known field to a weight when sparse. Survivors are weighted for prediction by **geographic proximity × recency × physical similarity** — closer, newer (newer EPCs are higher quality), more-similar comparables count more. **All cert vintages are kept** as source evidence — a building's physical components (wall / roof / floor / heating fuel / age) are agnostic of the SAP methodology that rated them, so a pre-SAP10 neighbour is valid evidence; recency is a graduated **weight**, never a hard drop, and matters most for the one component that genuinely goes stale (the heating system, when a boiler is replaced). Only the **validation target** is restricted to SAP 10.2 — see [[validation-cohort]] / [[sap-spec-version]] — because performance can only be checked against a same-spec lodged figure.
 _Avoid_: neighbours, similar properties, peer set

 **EPC Prediction**:
@ -129,6 +129,10 @@ _Avoid_: SAP version (ambiguous with the `sap_version` field on the cert, which

 **Validation Cohort**:
 The subset of corpus certs used to validate **SAP10 Calculation** against **Lodged Performance**, filtered to certs lodged after the calculator's target **SAP Spec Version** rolled out in commercial assessor software — currently `inspection_date ≥ 2025-07-01` (a buffer past 14-03-2025 to allow vendor rollout). Smaller than the full corpus but each cert is comparable under the same spec, so probe MAE is a clean signal of calculator-vs-spec correctness rather than spec-version mixture noise. ADR-0010.
+
+**Component Accuracy**:
+The primary, **calculator-independent** measure of **EPC Prediction** quality: how closely the predicted `EpcPropertyData` *components* (heating fuel + category + controls, hot water, wall / roof / floor construction + insulation, age band, glazing, doors, floor area + geometry) match the actual ones, scored by leave-one-out over a held-out target. Categoricals score as a classification hit-rate, numerics as a residual. Load-bearing principle: **predict the components well and correct SAP / carbon / PE fall out once calculator gaps close** — so Component Accuracy is what prediction is tuned against, while `calc(predicted)` vs API-lodged SAP / carbon / PE is a secondary, **calculator-floored** end-to-end check. The held-out target must be a **SAP 10.2** cert (`sap_version == 10.2`) — the only vintage with full-fidelity lodged components — but the source **Comparable Properties** cohort keeps all vintages. Never validated by `calc(predicted)` vs `calc(actual)`: that cancels (and so hides) calculator error against a circular ground truth.
+_Avoid_: prediction accuracy (vague), SAP accuracy (that is the calculator-floored end-to-end check, not the primary signal)
 _Avoid_: parity cohort, validation set, corpus sample

 **Measure Application**:
--- a/docs/adr/0029-epc-prediction-from-comparable-properties.md
+++ b/docs/adr/0029-epc-prediction-from-comparable-properties.md
@ -31,7 +31,7 @@ Rejected: field-by-field aggregation (legacy — incoherent, may not score sensi

 ### 5. Cohort selection: filter-then-relax ladder, weighted by geo × recency × similarity

-Selection hard-filters on identity (property type, built form) and any **known Landlord Override** (e.g. solid brick — the mixed-street border case) **while ≥ k comparables remain**, widening the geographic scope (postcode → postcode-prefix) or demoting a known to a strong weight when sparse. Survivors are weighted by **geographic proximity** (true coordinates via `GeospatialRepository`, not the legacy house-number proxy) **× recency** (newer EPCs are higher quality) **× physical similarity**; pre-SAP10 / very old certs are dropped. So a known field acts twice: upstream on cohort selection, and again as an override on the final picture.
+Selection hard-filters on identity (property type, built form) and any **known Landlord Override** (e.g. solid brick — the mixed-street border case) **while ≥ k comparables remain**, widening the geographic scope (postcode → postcode-prefix) or demoting a known to a strong weight when sparse. Survivors are weighted by **geographic proximity** (true coordinates via `GeospatialRepository`, not the legacy house-number proxy) **× recency** (newer EPCs are higher quality) **× physical similarity**; ~~pre-SAP10 / very old certs are dropped~~ (amended by [ADR-0030](0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md): all vintages are kept — components are methodology-agnostic — with recency as a graduated weight; only the *validation target* must be SAP 10.2). So a known field acts twice: upstream on cohort selection, and again as an override on the final picture.

 ### 6. Dual use: gap-fill (no EPC) and anomaly flags (has EPC)

@ -39,6 +39,8 @@ The same cohort + comparison machinery produces **EPC Anomaly Flags** for Proper

 ## Validation

+> **Superseded by [ADR-0030](0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md).** The SAP-version mixing in the cohort makes the lodged-SAP comparisons below (esp. the neighbour-mean baseline) invalid; validation is now component-first over SAP-10.2-only targets. The frozen-corpus leave-one-out shape stands.
+
 A **frozen postcode-clustered corpus** (a one-off fetch caches N postcodes × all their certs as `EpcPropertyData`) backs an offline, deterministic, repeatable **leave-one-out** harness over thousands of properties: drop a property with an EPC from its own cohort, predict it, compare predicted vs actual. Metrics: **classification rate** on wall / roof / floor construction + insulation and construction age band; **residuals** on SAP, total window area + window count, building-parts count, total floor area. SAP is reported three ways to attribute error — predicted-then-calculated vs (a) lodged SAP (end-to-end), (b) calculator-on-actual-components (isolates prediction error), (c) a direct neighbour-mean SAP baseline (proves predict-then-calculate beats naïve averaging).

 ## Open (implementation-level)
--- a/docs/adr/0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md
+++ b/docs/adr/0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md
@ -0,0 +1,52 @@
+# EPC Prediction validation is SAP-version-aware and component-first
+
+**Status: Accepted.** Supersedes the **Validation** section of [ADR-0029](0029-epc-prediction-from-comparable-properties.md) and amends its decision 5 (cohort selection). All other ADR-0029 decisions stand (predict a picture and score it; deterministic neighbour synthesis; fetch-phase fallback; hybrid mode + template synthesis; dual gap-fill / anomaly use).
+
+## Why this ADR exists
+
+ADR-0029's validation rested on a three-way SAP comparison, including a **neighbour-mean-of-lodged-SAP baseline** that predict-then-calculate was meant to beat. A second-order problem was invisible when that was written: the gov EPC register spans **multiple SAP spec versions**, and a property's neighbours are mostly *older* certs. In our development corpus only **~16% of certs are SAP 10.2** (`sap_version == 10.2`, schema 21.0.0 / 21.0.1); the rest were lodged under SAP 2012 (RdSAP 9.x). The same dwelling scores a *different* SAP under different spec versions, so:
+
+- **Averaging neighbours' lodged SAP is invalid** — it blends 2012 and 10.2 numbers to estimate a 10.2 target. The ADR-0029 "baseline" was never a fair comparator; on the real corpus it appeared to *beat* prediction purely as an artifact of this mixing. It is removed.
+- **Comparing our calculator's output to a neighbour's lodged figure is only meaningful within a same-spec cohort** — the existing **SAP Spec Version** / **Validation Cohort** rule (ADR-0010) already said this for calculator validation; it applies equally here.
+
+Separately, measuring `calc(predicted)` against the held-out cert's **lodged** SAP conflates two unrelated errors: the *prediction* error and the calculator's own **API-path residual** (~3 SAP on random gov-API certs today — a known, *separate* workstream, since the calculator pins at 1e-4 only on the Elmhurst worksheets). A perfect prediction still scores ~3 off lodged. So lodged-SAP error is the wrong thing to tune prediction against.
+
+## Decisions
+
+### 1. The source cohort keeps all cert vintages; only the validation **target** is SAP 10.2
+
+A building's physical **components** (wall / roof / floor / heating fuel / age band) are agnostic of the SAP methodology that rated them — a pre-SAP10 neighbour is valid *evidence* about the street. Dropping pre-SAP10 certs from the cohort (ADR-0029 decision 5) would discard ~84% of neighbours and gut prediction. So: **all vintages stay in the Comparable Properties cohort**, with recency as a graduated *weight* (never a hard drop), mattering most for the one component that genuinely goes stale — the heating system, when a boiler is replaced. Only the held-out **validation target** is restricted to `sap_version == 10.2`, the only vintage with full-fidelity lodged components to check against. (Target selection uses the API `sap_version` field directly, not the `inspection_date ≥ 2025-07-01` proxy.)
+
+### 2. **Component Accuracy** is the primary, calculator-independent signal
+
+Prediction is tuned against how closely the predicted `EpcPropertyData` *components* match the actual ones — **not** against any SAP score. Scored by leave-one-out over a 10.2 target: categoricals as a classification hit-rate (with `None` = not-applicable excluded from the denominator), numerics as a residual. Coverage spans the SAP-load-bearing components, led by **heating** (the proven dominant lever — ablating heating to the actual cuts the SAP error from ~7 to ~4.5):
+
+- **Heating** — main fuel, main category, main control, water-heating fuel/code, has-cylinder, cylinder insulation, secondary heating
+- **Fabric** — wall / roof / floor construction + insulation, age band (plus a **±1-band** rate, since adjacent bands ≈ same U-value), room-in-roof
+- **Glazing** — modal glazing type; window count + total-area residuals
+- **Counts / geometry** — door count, building-parts count, floor area
+- **Renewables** — PV presence, solar water heating
+
+Load-bearing principle: **predict the components well and correct SAP / carbon / PE fall out once calculator gaps close.** Component Accuracy makes progress even while the calculator moves underneath us.
+
+### 3. `calc(predicted)` vs **API-lodged** SAP / carbon / PE is a secondary, calculator-floored check — and two comparisons are rejected
+
+The end-to-end number — does the predicted picture score like the official 10.2 EPC — is reported but **not** the thing we drive to zero: it is floored by the calculator's API-path residual and improves as *both* prediction and the calculator workstream land. Carbon and PE are *more* version-sensitive than SAP (grid factors shifted sharply between SAP 2012 and 10.2), so they too are compared only on 10.2 targets.
+
+Rejected:
+- **`calc(predicted)` vs `calc(actual)`** — cancels (and so *hides*) calculator error against a *circular* ground truth (our own calculator); a systematically wrong prediction in the calculator's blind spot would score perfectly. Not a validation signal; at most an internal attribution diagnostic.
+- **neighbour-mean-of-lodged-SAP baseline** — mixes SAP versions (see above).
+
+No synthetic SAP-weighted Component Accuracy index: weighting components by SAP impact reintroduces the calculator. The per-component table stays flat; the end-to-end MAE *is* the holistic rollup.
+
+### 4. Two validation tiers, one shared scorer
+
+- **Tier 1 — committed CI gate.** A small, **anonymized**, frozen fixture under `tests/fixtures/` (addresses hashed — the predictor uses address only as a dedup key — `post_town` dropped; postcode + component fields retained; gov data is OGL). A **ratcheting regression gate**: each per-component floor / residual ceiling is the currently-measured value and only ever *tightens* (honouring the repo's no-tolerance-widening ethos); a regression fails the build. End-to-end SAP / carbon / PE thresholds are loose and explicitly **calculator-floored** — gross-regression guards, not targets. Gates when the fixture is present; skips with a message otherwise.
+- **Tier 2 — offline national battle-test.** Built on `harness/epc_bulk.py` (streams the gov **bulk export** via HTTP range requests, filtered by `sap_version`) and `harness/cohort.py` (offline sweep that **captures per-cert raises** instead of aborting). Streams the register and **buckets by postcode** — because bulk is the *whole* register, every postcode is dense, giving national breadth *and* dense cohorts at once. Over tens of thousands of 10.2 targets it emits the Component Accuracy table, the end-to-end MAE, **and a failure taxonomy** (unsupported-schema, mapper raise, calculator raise, no-cohort, no-10.2-target) — the battle-test half. Not committed, not CI-gated; its numbers periodically **re-baseline the Tier-1 floors**.
+
+Both tiers run the *same* `compare_prediction` + calculator logic — one scorer, two harnesses.
+
+## Consequences
+
+- ADR-0029's "Validation" section and its decision-5 clause "pre-SAP10 / very old certs are dropped" are superseded by the above. The CONTEXT terms **Comparable Properties** (all-vintage source) and **Component Accuracy** (new) are updated to match.
+- The Tier-1 fixture is the first committed gov-EPC fixture sized for *statistical* stability rather than worksheet-exact pinning — a deliberate departure from the repo's 1e-4 pin convention, justified by prediction's irreducible error.