mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-30 13:10:47 +00:00
docs(adr): ADR-0030 — SAP-version-aware, component-first EPC Prediction validation
Records the grilling-session decisions amending ADR-0029's validation:
- Source cohort keeps all cert vintages (components are agnostic of the SAP
methodology that rated them); only the held-out validation TARGET is
restricted to SAP 10.2. Amends ADR-0029 decision 5 ("pre-SAP10 dropped").
- Component Accuracy (predicted vs API actual components) is the primary,
calculator-independent signal. calc(predicted) vs calc(actual) rejected
(circular ground truth, hides calculator error); neighbour-mean-lodged-SAP
baseline rejected (mixes SAP versions). calc(predicted) vs API-lodged
SAP/carbon/PE kept as a secondary, calculator-floored guard.
- Two tiers: committed anonymized fixture (ratcheting CI gate) + bulk-export
national battle-test on harness/epc_bulk.py + harness/cohort.py, emitting
accuracy + a failure taxonomy, re-baselining the gate floors.
CONTEXT.md: Comparable Properties corrected to all-vintage source; new
Component Accuracy term. ADR-0029 Validation section marked superseded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
parent
c3d56b00dd
commit
35a7c07812
3 changed files with 60 additions and 2 deletions
|
|
@ -62,7 +62,7 @@ A structured dataclass (`domain.addresses.user_address.UserAddress`) capturing a
|
|||
_Avoid_: user input, raw address, user_inputed_address
|
||||
|
||||
**Comparable Properties**:
|
||||
The reference cohort matched to a target Property, used by **EPC Prediction** for gap-filling and anomaly detection. Selected by a **filter-then-relax ladder**: hard filters on identity (property type, built form) and any known **Landlord Override** (e.g. a known solid-brick wall) while at least *k* remain, widening the geographic scope (postcode → postcode-prefix) or demoting a known field to a weight when sparse. Survivors are weighted for prediction by **geographic proximity × recency × physical similarity** — closer, newer (newer EPCs are higher quality), more-similar comparables count more; pre-SAP10 / very old certs are dropped.
|
||||
The reference cohort matched to a target Property, used by **EPC Prediction** for gap-filling and anomaly detection. Selected by a **filter-then-relax ladder**: hard filters on identity (property type, built form) and any known **Landlord Override** (e.g. a known solid-brick wall) while at least *k* remain, widening the geographic scope (postcode → postcode-prefix) or demoting a known field to a weight when sparse. Survivors are weighted for prediction by **geographic proximity × recency × physical similarity** — closer, newer (newer EPCs are higher quality), more-similar comparables count more. **All cert vintages are kept** as source evidence — a building's physical components (wall / roof / floor / heating fuel / age) are agnostic of the SAP methodology that rated them, so a pre-SAP10 neighbour is valid evidence; recency is a graduated **weight**, never a hard drop, and matters most for the one component that genuinely goes stale (the heating system, when a boiler is replaced). Only the **validation target** is restricted to SAP 10.2 — see [[validation-cohort]] / [[sap-spec-version]] — because performance can only be checked against a same-spec lodged figure.
|
||||
_Avoid_: neighbours, similar properties, peer set
|
||||
|
||||
**EPC Prediction**:
|
||||
|
|
@ -129,6 +129,10 @@ _Avoid_: SAP version (ambiguous with the `sap_version` field on the cert, which
|
|||
|
||||
**Validation Cohort**:
|
||||
The subset of corpus certs used to validate **SAP10 Calculation** against **Lodged Performance**, filtered to certs lodged after the calculator's target **SAP Spec Version** rolled out in commercial assessor software — currently `inspection_date ≥ 2025-07-01` (a buffer past 14-03-2025 to allow vendor rollout). Smaller than the full corpus but each cert is comparable under the same spec, so probe MAE is a clean signal of calculator-vs-spec correctness rather than spec-version mixture noise. ADR-0010.
|
||||
|
||||
**Component Accuracy**:
|
||||
The primary, **calculator-independent** measure of **EPC Prediction** quality: how closely the predicted `EpcPropertyData` *components* (heating fuel + category + controls, hot water, wall / roof / floor construction + insulation, age band, glazing, doors, floor area + geometry) match the actual ones, scored by leave-one-out over a held-out target. Categoricals score as a classification hit-rate, numerics as a residual. Load-bearing principle: **predict the components well and correct SAP / carbon / PE fall out once calculator gaps close** — so Component Accuracy is what prediction is tuned against, while `calc(predicted)` vs API-lodged SAP / carbon / PE is a secondary, **calculator-floored** end-to-end check. The held-out target must be a **SAP 10.2** cert (`sap_version == 10.2`) — the only vintage with full-fidelity lodged components — but the source **Comparable Properties** cohort keeps all vintages. Never validated by `calc(predicted)` vs `calc(actual)`: that cancels (and so hides) calculator error against a circular ground truth.
|
||||
_Avoid_: prediction accuracy (vague), SAP accuracy (that is the calculator-floored end-to-end check, not the primary signal)
|
||||
_Avoid_: parity cohort, validation set, corpus sample
|
||||
|
||||
**Measure Application**:
|
||||
|
|
|
|||
|
|
@ -31,7 +31,7 @@ Rejected: field-by-field aggregation (legacy — incoherent, may not score sensi
|
|||
|
||||
### 5. Cohort selection: filter-then-relax ladder, weighted by geo × recency × similarity
|
||||
|
||||
Selection hard-filters on identity (property type, built form) and any **known Landlord Override** (e.g. solid brick — the mixed-street border case) **while ≥ k comparables remain**, widening the geographic scope (postcode → postcode-prefix) or demoting a known to a strong weight when sparse. Survivors are weighted by **geographic proximity** (true coordinates via `GeospatialRepository`, not the legacy house-number proxy) **× recency** (newer EPCs are higher quality) **× physical similarity**; pre-SAP10 / very old certs are dropped. So a known field acts twice: upstream on cohort selection, and again as an override on the final picture.
|
||||
Selection hard-filters on identity (property type, built form) and any **known Landlord Override** (e.g. solid brick — the mixed-street border case) **while ≥ k comparables remain**, widening the geographic scope (postcode → postcode-prefix) or demoting a known to a strong weight when sparse. Survivors are weighted by **geographic proximity** (true coordinates via `GeospatialRepository`, not the legacy house-number proxy) **× recency** (newer EPCs are higher quality) **× physical similarity**; ~~pre-SAP10 / very old certs are dropped~~ (amended by [ADR-0030](0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md): all vintages are kept — components are methodology-agnostic — with recency as a graduated weight; only the *validation target* must be SAP 10.2). So a known field acts twice: upstream on cohort selection, and again as an override on the final picture.
|
||||
|
||||
### 6. Dual use: gap-fill (no EPC) and anomaly flags (has EPC)
|
||||
|
||||
|
|
@ -39,6 +39,8 @@ The same cohort + comparison machinery produces **EPC Anomaly Flags** for Proper
|
|||
|
||||
## Validation
|
||||
|
||||
> **Superseded by [ADR-0030](0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md).** The SAP-version mixing in the cohort makes the lodged-SAP comparisons below (esp. the neighbour-mean baseline) invalid; validation is now component-first over SAP-10.2-only targets. The frozen-corpus leave-one-out shape stands.
|
||||
|
||||
A **frozen postcode-clustered corpus** (a one-off fetch caches N postcodes × all their certs as `EpcPropertyData`) backs an offline, deterministic, repeatable **leave-one-out** harness over thousands of properties: drop a property with an EPC from its own cohort, predict it, compare predicted vs actual. Metrics: **classification rate** on wall / roof / floor construction + insulation and construction age band; **residuals** on SAP, total window area + window count, building-parts count, total floor area. SAP is reported three ways to attribute error — predicted-then-calculated vs (a) lodged SAP (end-to-end), (b) calculator-on-actual-components (isolates prediction error), (c) a direct neighbour-mean SAP baseline (proves predict-then-calculate beats naïve averaging).
|
||||
|
||||
## Open (implementation-level)
|
||||
|
|
|
|||
|
|
@ -0,0 +1,52 @@
|
|||
# EPC Prediction validation is SAP-version-aware and component-first
|
||||
|
||||
**Status: Accepted.** Supersedes the **Validation** section of [ADR-0029](0029-epc-prediction-from-comparable-properties.md) and amends its decision 5 (cohort selection). All other ADR-0029 decisions stand (predict a picture and score it; deterministic neighbour synthesis; fetch-phase fallback; hybrid mode + template synthesis; dual gap-fill / anomaly use).
|
||||
|
||||
## Why this ADR exists
|
||||
|
||||
ADR-0029's validation rested on a three-way SAP comparison, including a **neighbour-mean-of-lodged-SAP baseline** that predict-then-calculate was meant to beat. A second-order problem was invisible when that was written: the gov EPC register spans **multiple SAP spec versions**, and a property's neighbours are mostly *older* certs. In our development corpus only **~16% of certs are SAP 10.2** (`sap_version == 10.2`, schema 21.0.0 / 21.0.1); the rest were lodged under SAP 2012 (RdSAP 9.x). The same dwelling scores a *different* SAP under different spec versions, so:
|
||||
|
||||
- **Averaging neighbours' lodged SAP is invalid** — it blends 2012 and 10.2 numbers to estimate a 10.2 target. The ADR-0029 "baseline" was never a fair comparator; on the real corpus it appeared to *beat* prediction purely as an artifact of this mixing. It is removed.
|
||||
- **Comparing our calculator's output to a neighbour's lodged figure is only meaningful within a same-spec cohort** — the existing **SAP Spec Version** / **Validation Cohort** rule (ADR-0010) already said this for calculator validation; it applies equally here.
|
||||
|
||||
Separately, measuring `calc(predicted)` against the held-out cert's **lodged** SAP conflates two unrelated errors: the *prediction* error and the calculator's own **API-path residual** (~3 SAP on random gov-API certs today — a known, *separate* workstream, since the calculator pins at 1e-4 only on the Elmhurst worksheets). A perfect prediction still scores ~3 off lodged. So lodged-SAP error is the wrong thing to tune prediction against.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. The source cohort keeps all cert vintages; only the validation **target** is SAP 10.2
|
||||
|
||||
A building's physical **components** (wall / roof / floor / heating fuel / age band) are agnostic of the SAP methodology that rated them — a pre-SAP10 neighbour is valid *evidence* about the street. Dropping pre-SAP10 certs from the cohort (ADR-0029 decision 5) would discard ~84% of neighbours and gut prediction. So: **all vintages stay in the Comparable Properties cohort**, with recency as a graduated *weight* (never a hard drop), mattering most for the one component that genuinely goes stale — the heating system, when a boiler is replaced. Only the held-out **validation target** is restricted to `sap_version == 10.2`, the only vintage with full-fidelity lodged components to check against. (Target selection uses the API `sap_version` field directly, not the `inspection_date ≥ 2025-07-01` proxy.)
|
||||
|
||||
### 2. **Component Accuracy** is the primary, calculator-independent signal
|
||||
|
||||
Prediction is tuned against how closely the predicted `EpcPropertyData` *components* match the actual ones — **not** against any SAP score. Scored by leave-one-out over a 10.2 target: categoricals as a classification hit-rate (with `None` = not-applicable excluded from the denominator), numerics as a residual. Coverage spans the SAP-load-bearing components, led by **heating** (the proven dominant lever — ablating heating to the actual cuts the SAP error from ~7 to ~4.5):
|
||||
|
||||
- **Heating** — main fuel, main category, main control, water-heating fuel/code, has-cylinder, cylinder insulation, secondary heating
|
||||
- **Fabric** — wall / roof / floor construction + insulation, age band (plus a **±1-band** rate, since adjacent bands ≈ same U-value), room-in-roof
|
||||
- **Glazing** — modal glazing type; window count + total-area residuals
|
||||
- **Counts / geometry** — door count, building-parts count, floor area
|
||||
- **Renewables** — PV presence, solar water heating
|
||||
|
||||
Load-bearing principle: **predict the components well and correct SAP / carbon / PE fall out once calculator gaps close.** Component Accuracy makes progress even while the calculator moves underneath us.
|
||||
|
||||
### 3. `calc(predicted)` vs **API-lodged** SAP / carbon / PE is a secondary, calculator-floored check — and two comparisons are rejected
|
||||
|
||||
The end-to-end number — does the predicted picture score like the official 10.2 EPC — is reported but **not** the thing we drive to zero: it is floored by the calculator's API-path residual and improves as *both* prediction and the calculator workstream land. Carbon and PE are *more* version-sensitive than SAP (grid factors shifted sharply between SAP 2012 and 10.2), so they too are compared only on 10.2 targets.
|
||||
|
||||
Rejected:
|
||||
- **`calc(predicted)` vs `calc(actual)`** — cancels (and so *hides*) calculator error against a *circular* ground truth (our own calculator); a systematically wrong prediction in the calculator's blind spot would score perfectly. Not a validation signal; at most an internal attribution diagnostic.
|
||||
- **neighbour-mean-of-lodged-SAP baseline** — mixes SAP versions (see above).
|
||||
|
||||
No synthetic SAP-weighted Component Accuracy index: weighting components by SAP impact reintroduces the calculator. The per-component table stays flat; the end-to-end MAE *is* the holistic rollup.
|
||||
|
||||
### 4. Two validation tiers, one shared scorer
|
||||
|
||||
- **Tier 1 — committed CI gate.** A small, **anonymized**, frozen fixture under `tests/fixtures/` (addresses hashed — the predictor uses address only as a dedup key — `post_town` dropped; postcode + component fields retained; gov data is OGL). A **ratcheting regression gate**: each per-component floor / residual ceiling is the currently-measured value and only ever *tightens* (honouring the repo's no-tolerance-widening ethos); a regression fails the build. End-to-end SAP / carbon / PE thresholds are loose and explicitly **calculator-floored** — gross-regression guards, not targets. Gates when the fixture is present; skips with a message otherwise.
|
||||
- **Tier 2 — offline national battle-test.** Built on `harness/epc_bulk.py` (streams the gov **bulk export** via HTTP range requests, filtered by `sap_version`) and `harness/cohort.py` (offline sweep that **captures per-cert raises** instead of aborting). Streams the register and **buckets by postcode** — because bulk is the *whole* register, every postcode is dense, giving national breadth *and* dense cohorts at once. Over tens of thousands of 10.2 targets it emits the Component Accuracy table, the end-to-end MAE, **and a failure taxonomy** (unsupported-schema, mapper raise, calculator raise, no-cohort, no-10.2-target) — the battle-test half. Not committed, not CI-gated; its numbers periodically **re-baseline the Tier-1 floors**.
|
||||
|
||||
Both tiers run the *same* `compare_prediction` + calculator logic — one scorer, two harnesses.
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0029's "Validation" section and its decision-5 clause "pre-SAP10 / very old certs are dropped" are superseded by the above. The CONTEXT terms **Comparable Properties** (all-vintage source) and **Component Accuracy** (new) are updated to match.
|
||||
- The Tier-1 fixture is the first committed gov-EPC fixture sized for *statistical* stability rather than worksheet-exact pinning — a deliberate departure from the repo's 1e-4 pin convention, justified by prediction's irreducible error.
|
||||
Loading…
Add table
Reference in a new issue