docs: ADR-0010 retargets calculator to SAP 10.2; rewrite handover

Adds ADR-0010 superseding ADR-0009's spec-version target, PCDB
sequencing, and cert-calibration layer. Captures the conclusions
of a grill-with-docs session:

  1. Active spec target is SAP 10.2 (14-03-2025), not SAP 10.3 — no
     SAP-10.3-lodged certs exist in the corpus to validate against.
  2. table_12_cert_calibration is deleted (not "re-derived at the
     end"). It was pre-March-2025 spec prices fit against a mixture
     distribution of two spec-version regimes, with downstream-
     component bugs absorbed into the fit — not Elmhurst deviation.
  3. Validation Cohort: filter the corpus to inspection_date ≥
     2025-07-01 so every cert in the probe was lodged on SAP 10.2
     (14-03-2025) prices. One spec, one signal.
  4. PCDB integration is promoted from "Session C deferred" to
     prerequisite P4 — dominates residual variance on heat pumps and
     the 78% of gas-boiler certs lodging main_heating_data_source=1.
  5. Trace mode (SapResult.intermediate) and BRE worked-example
     fixtures replace the 7 cert-based golden fixtures, which
     contained compensating errors.
  6. Strict-type EpcPropertyData via codes.csv-derived canonical
     enums (P6) — the in-source motivation lives at
     dimensions.py:74-82 (Khalim's comment, included in this commit).
  7. Worksheet-faithful structure is a sweep-time principle: each
     worksheet module mirrors SAP 10.2 worksheet line numbering.

CONTEXT.md additions:
  - Refined "Calculated SAP10 Performance" and "SAP10 Calculation"
    to reference SAP 10.2 + ADR-0010.
  - New term "SAP Spec Version" — domain-meaningful because the
    same EpcPropertyData yields different sap_score under different
    spec revisions.
  - New term "Validation Cohort" — the version-locked sub-corpus.

HANDOVER_SYSTEMATIC_REVIEW.md is rewritten section-by-section to
reflect ADR-0010: §1 framing, §2 status pointer, new §2.5 with the
six prerequisites P1–P6 in dependency order, §3 diagnosis (cert-cal
was stale prices, not Elmhurst deviation), §4 scope (PCDB IN,
SAP 10.3 stays OUT), §5 approach (worksheet-faithful principle as
§5.5), §7 tension dissolved, §7b findings re-framed, §8 dead-ends
re-classified as conditional, §9 cohort filter, §10 fixture
strategy, §11 trace mode as prerequisite, §12 prereqs-first,
§13 Phase 0/Phase 1 workflow, §14 ADR-0010 reference, §15 final
note.

P2.1 (commit ac1aa56a) already lands the first ADR-0010 slice
(probe swap to spec prices).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-05-19 09:54:24 +00:00
parent ac1aa56ab1
commit bb9c5ac017
4 changed files with 551 additions and 209 deletions

View file

@ -98,12 +98,20 @@ The SAP / EPC Band / carbon emissions / heat demand the modelling pipeline actua
_Avoid_: modelled performance, rebaselined performance (only correct when rebaselining ran), scored values _Avoid_: modelled performance, rebaselined performance (only correct when rebaselining ran), scored values
**Calculated SAP10 Performance**: **Calculated SAP10 Performance**:
The SAP score, EPC Band, CO2 emissions, Primary Energy Intensity, space heating kWh, and hot water kWh produced by **SAP10 Calculation** from a Property's EpcPropertyData. Distinct from Effective Performance (ML output) and Lodged Performance (gov register) during the validation phase. Surfaced alongside Effective Performance in the UI; may supersede Effective Performance in a later ADR once parity is confirmed against the cert-reported SAP across ≥1000 sample certs. ADR-0009. The SAP score, EPC Band, CO2 emissions, Primary Energy Intensity, space heating kWh, and hot water kWh produced by **SAP10 Calculation** from a Property's EpcPropertyData. Distinct from Effective Performance (ML output) and Lodged Performance (gov register) during the validation phase. Surfaced alongside Effective Performance in the UI; may supersede Effective Performance in a later ADR once parity is confirmed against the cert-reported SAP across ≥1000 sample certs lodged on the calculator's target spec version (see [[sap-spec-version]]). ADR-0009 (as amended by ADR-0010).
_Avoid_: calculator output, computed performance, worksheet performance, SAP10 output _Avoid_: calculator output, computed performance, worksheet performance, SAP10 output
**SAP10 Calculation**: **SAP10 Calculation**:
The process that runs the deterministic SAP 10.3 worksheet over a Property's EpcPropertyData and emits **Calculated SAP10 Performance**. Implemented by the `Sap10Calculator` service class in `domain/sap/`. Reads cert fabric/heating/geometry fields, applies the RdSAP 10 cert→input mapping, executes the 12-month heat balance per SAP 10.3 §§1-14, and returns a `SapResult` carrying the five Calculated SAP10 Performance quantities plus a monthly breakdown and worksheet-line audit trail. Distinct from **Rebaselining**, which is ML-based. ADR-0009. The process that runs the deterministic SAP 10.2 (14-03-2025 amendment) worksheet over a Property's EpcPropertyData and emits **Calculated SAP10 Performance**. Implemented by the `Sap10Calculator` service class in `domain/sap/`. Reads cert fabric/heating/geometry fields, applies the RdSAP 10 (10-06-2025) cert→input mapping, executes the 12-month heat balance per SAP 10.2 §§1-14, looks up boiler/heat-pump performance in the **PCDB** when the cert lodges a product index, and returns a `SapResult` carrying the five Calculated SAP10 Performance quantities plus a monthly breakdown and worksheet-line audit trail. Distinct from **Rebaselining**, which is ML-based. ADR-0009 originally targeted SAP 10.3 (13-01-2026); ADR-0010 retargets to SAP 10.2 (14-03-2025) until the cert corpus migrates.
_Avoid_: SAP calculation (ambiguous with the gov calculator), SAP scoring, calculator run _Avoid_: SAP calculation (ambiguous with the gov calculator), SAP scoring, calculator run, SAP 10.3 calculation (active target is 10.2 — see [[sap-spec-version]])
**SAP Spec Version**:
The dated revision of the SAP specification that produced a given SAP/PEUI/CO2 value. Domain-meaningful because the same EpcPropertyData yields different `sap_score` under different spec versions — fuel-price tables, CO2 factors, PCDB references, and rating-equation deflators all change between revisions. **Lodged Performance** carries the version current when the cert was lodged (mostly SAP 10.1 / SAP 10.2 pre- and post-14-03-2025 amendment in the corpus). **Calculated SAP10 Performance** is locked to SAP 10.2 (14-03-2025). A 1-to-1 Lodged-vs-Calculated comparison therefore only makes sense within a **Validation Cohort** of certs lodged on the same spec version.
_Avoid_: SAP version (ambiguous with the `sap_version` field on the cert, which only carries the major version like 10.2 — not the amendment date), spec revision
**Validation Cohort**:
The subset of corpus certs used to validate **SAP10 Calculation** against **Lodged Performance**, filtered to certs lodged after the calculator's target **SAP Spec Version** rolled out in commercial assessor software — currently `inspection_date ≥ 2025-07-01` (a buffer past 14-03-2025 to allow vendor rollout). Smaller than the full corpus but each cert is comparable under the same spec, so probe MAE is a clean signal of calculator-vs-spec correctness rather than spec-version mixture noise. ADR-0010.
_Avoid_: parity cohort, validation set, corpus sample
**Measure Application**: **Measure Application**:
The process that translates an Optimised Package into cert-field changes and produces the "ending state snapshot" EpcPropertyData that Plan Phase persists. Implemented by the `MeasureApplicator` service class in `domain/sap/` (or a sibling package). Each Measure Type's translation rules (e.g. `loft_insulation``roof_insulation_thickness_mm = 270mm`, `ashp``main_heating_details[0]` replacement) live here. Pure function — does not run SAP10 Calculation itself; the caller chains `MeasureApplicator.apply(epc, package) → Sap10Calculator.calculate(post_epc)`. ADR-0009. The process that translates an Optimised Package into cert-field changes and produces the "ending state snapshot" EpcPropertyData that Plan Phase persists. Implemented by the `MeasureApplicator` service class in `domain/sap/` (or a sibling package). Each Measure Type's translation rules (e.g. `loft_insulation``roof_insulation_thickness_mm = 270mm`, `ashp``main_heating_details[0]` replacement) live here. Pure function — does not run SAP10 Calculation itself; the caller chains `MeasureApplicator.apply(epc, package) → Sap10Calculator.calculate(post_epc)`. ADR-0009.

View file

@ -0,0 +1,68 @@
# Retarget Sap10Calculator to SAP 10.2 (14-03-2025); delete cert-calibration; validate on a spec-version-locked cohort
**Status: Accepted.** Supersedes the spec-version target, the PCDB sequencing, and the cert-calibration layer of [ADR-0009](0009-deterministic-sap-calculator.md). Adds strict typing of `EpcPropertyData` (P6) and a worksheet-faithful structural principle for the `domain/sap/worksheet/*` modules — both new concerns ADR-0009 didn't address. All other ADR-0009 decisions stand (Calculated SAP10 Performance as a glossary term, MeasureApplicator/Sap10Calculator chain, MCS boolean default-false, global thermal-bridging y factor, Table 27 living-area fraction, Table 11 secondary-heating allocation, MeasureOverrides rejection).
## Why this ADR exists
ADR-0009 was written before a second-order problem in the validation corpus was visible: the 250k-cert training parquet spans **multiple SAP spec versions** (SAP 10.1 from 2019, SAP 10.2 pre- and post-14-March-2025 amendment), each of which was the active table when its certs were lodged. The prior session's `domain.sap.tables.table_12_cert_calibration` layer was implicitly absorbing this version mixture into a single "best fit" price set ~1025 % lower than the SAP 10.2 (14-03-2025) spec — closer to the SAP 10.1 era prices. Every spec-correctness slice that touched a downstream component (HW cylinder zero-loss, gas standing charges, Table 12a fractional blending) registered as a regression on the parity probe because the cert-cal layer had been numerically calibrated against the buggy state of every other component.
This ADR resolves four entangled decisions at once. They are coupled — none of them is the right call in isolation.
## Decisions
### 1. Active spec target is **SAP 10.2 (14-03-2025)**, not SAP 10.3
ADR-0009 named SAP 10.3 (13-01-2026) as the calculator's target. No SAP-10.3-lodged certs exist in the corpus; assessor software has not migrated. Targeting SAP 10.3 produces a calculator whose output is verifiable against no cert. The active target is SAP 10.2 (14-03-2025 amendment) — both the document RdSAP 10 (10-06-2025) cross-references for heating-system identification, and the amendment that current assessor software is on.
`packages/domain/src/domain/sap/tables/table_12.py` is re-labelled as SAP 10.2 (14-03-2025). Its CO2 factors are corrected to spec (0.210 kg/kWh mains gas, 0.136 kg/kWh standard electricity — the file currently has SAP 10.3 values 0.214 and 0.086). Prices already match SAP 10.2 (3.64 p mains gas, 16.49 p standard electricity, etc.) — the misleading "+25 % shift from SAP 10.2 to 10.3" comment is removed; the 13.19 p figure is from SAP 10.1, not SAP 10.2.
A future ADR retargets to SAP 10.3 once the cert corpus migrates (expected late 2026 or 2027 once BRE updates RdSAP to reference SAP 10.3).
### 2. `table_12_cert_calibration` is deleted
The cert-calibration table is bug-masking. Its prices are pre-March-2025 SAP values fit against the average cert in a mixed-version corpus, with downstream-component bugs absorbed into the fit. Removing it forces upstream errors to surface where they live, in the component that owns them, instead of being silently compensated for by a price tweak.
This includes the `cert_calibration_e7_codes` extension that routes codes 191196 (direct-electric) and 691696 (room heaters) to off-peak rates — Table 12a is explicit that "other direct-acting electric heating" bills 100 % at the high rate on a 7-hour tariff. The S-B14 finding that motivated this hack is in §8 of the handover as a documented dead-end.
`domain.sap.tables.table_12.unit_price_p_per_kwh` becomes the only price API. Parity probes are updated to use it.
### 3. Validation Cohort is filtered to a single spec-version window
Probe MAE against the full 250k-cert corpus measures both calculator correctness *and* the spec-version drift across certs lodged at different times. Without separating them, every spec-correctness improvement is noisy.
The **Validation Cohort** is the subset of corpus certs with `inspection_date ≥ 2025-07-01` — chosen to allow ~4 months past the 14-March-2025 SAP 10.2 amendment for commercial assessor software to roll out the new tables. Filtering to this cohort yields a probe where every cert was lodged on the same spec version the calculator targets. MAE on the Validation Cohort is the only metric used for spec-sweep go/no-go.
This requires re-extracting the training parquet to include `inspection_date` (currently dropped by the ETL — 202 columns, none of them dates). That extraction is a prerequisite slice.
### 4. PCDB integration is promoted from Session C to a prerequisite
ADR-0009 deferred PCDB to Session C and shipped a `NoOpPcdbLookup` stub. The handover's own measurements show PCDB absence accounts for ~19 SAP points of MAE on heat-pump certs (Table 4a fallback SCOP 2.30 vs typical PCDB 2.803.50) and most per-cert variance on the 78 % of gas-boiler certs lodging `main_heating_data_source=1` (category-default 0.80 vs typical PCDB 0.880.94). The handover's rationale for deferral ("cert-cal absorbs PCDB gaps") collapses with decision (2).
PCDB lookup against `main_heating_index_number` is built before the section-by-section sweep starts. Data source: https://www.ncm-pcdb.org.uk — CSV exports of boilers and heat pumps. Per-product fields needed: seasonal efficiency, secondary efficiency, output kW, flow-temperature curve (heat pumps). The `NoOpPcdbLookup` seam from ADR-0009 grill outcome #1 is the integration point; the stub returns None and the calculator falls back to Table 4a only when the cert lodges no `main_heating_index_number` or the PCDB has no matching record.
## Verification infrastructure (also prerequisites)
Three pieces of infrastructure are built before the section sweep so per-section verification has unambiguous signal:
1. **Trace mode populated.** ADR-0009 specced `SapResult.intermediate: dict[str, float]` and it was never built. Every named SAP 10.2 worksheet variable (heat transfer coefficient, mean internal temperature, monthly solar gains, utilisation factor, ECF, etc.) is exposed on `intermediate` so any single cert can be diffed against a hand-computed value, a BRE worked example, or a future Elmhurst reference trace.
2. **BRE worked-example unit tests.** SAP 10.2 spec appendices and RdSAP 10 worked examples are transcribed as fixtures keyed on per-intermediate expected values, not aggregate SAP score. These replace the 7 cert-based golden fixtures (which contained compensating errors per the handover §10). The cert fixtures are retired.
3. **Strict typing of `EpcPropertyData` via canonical domain enums.** Bare `str` and `Union[int, str]` fields (the latter because the gov API gives ints and Site Notes give strings) cascade defensive type-handling into every consumer — the calculator's `dimensions.py:74-82` is Khalim's documented example. The domain holds one canonical enum per field, derived from `datatypes/epc/domain/epc_codes.csv` (union of keys across schema versions, hand-authored). The API mapper and Site Notes mapper each adapt their raw input to the canonical enum. Repo-wide test compatibility is a hard constraint — every consumer of `EpcPropertyData` (calculator, ML pipeline, recommendations, ETL) continues working after the typing pass. Pyright `strict` mode stays clean.
These map to prerequisites P5 (trace mode + BRE fixtures) and P6 (strict typing) in the handover §2.5.
## Worksheet-faithful structure (sweep-time principle)
Each `domain/sap/worksheet/*.py` module must mirror the SAP 10.2 worksheet structure for its section — function names reference their worksheet-line origin (e.g. `heat_transfer_coefficient` aligns with worksheet line (40)), compound calculations split into one function per line where possible, defensive type-handling replaced by typed-enum dispatch. This is not a prerequisite slice; the refactor lands as part of each section's sweep slice, verified by the BRE worked examples (which assert per-intermediate values).
## Consequences
- ADR-0009's "MAE ≤ 1.0 SAP-point on typical subset" success criterion is restated against the Validation Cohort (not the full corpus). The "typical subset" exclusions in ADR-0009 (sap_score ≤ 5, ≥ 100, multi-heating, conservatory, RIR) still apply on top of the cohort filter.
- The training parquet schema bumps when `inspection_date` is added — a non-breaking MINOR addition under [ADR-0008](0008-physics-as-feature.md)'s `Feature Schema Version` discipline.
- The handover document `docs/sap-spec/HANDOVER_SYSTEMATIC_REVIEW.md` is rewritten in lockstep: §3 (diagnosis), §4 (scope), §7 (state-A-vs-state-B framing deleted), §7b (findings re-framed), §10 (fixture strategy), and a new §2.5 listing the five prerequisites.
- Sessions A/B/C from ADR-0009 collapse into a single sequence: prerequisites land, then the section sweep runs against a clean probe with PCDB available.
## Considered alternatives
- **Build versioned Table 12 (pre/post 14-March-2025) keyed on `inspection_date` and validate across the full corpus.** Rejected as more work for no signal benefit during the spec sweep — the filtered cohort gets us to a clean probe faster. A versioned table is still future work if Calculated SAP10 Performance ever needs to reproduce historical cert SAP for products that compare against Lodged Performance directly.
- **Keep cert-cal during the sweep and re-derive at the end** (the handover's prescription). Rejected for the reasons in decision (2): the cert-cal layer corrupts the signal during the sweep, which is precisely when the signal needs to be cleanest.
- **Pay for an Elmhurst license, lock fixtures to its output.** Held in reserve. BRE worked examples are free and spec-derived; an Elmhurst trace would add value as a per-component reference but is not a prerequisite.

View file

@ -24,10 +24,12 @@ The SAP/RdSAP energy assessment splits cleanly into two roles:
takes the lodged fields and produces SAP score, CO2 emissions, takes the lodged fields and produces SAP score, CO2 emissions,
primary energy (PEUI), CO2 per m², EI rating, etc. primary energy (PEUI), CO2 per m², EI rating, etc.
**Our calculator is replicating role #2.** Where Elmhurst's **Our calculator is replicating role #2.** Assessor software
implementation diverges from spec, we follow Elmhurst, but we don't implements the SAP 10.2 spec faithfully; the question of "where does
guess at divergence; we localise it via reference traces or Elmhurst diverge from spec?" is no longer the operative one (per
empirically against the cert corpus. ADR-0010 + §3 below). Our job is to enumerate every spec
table / formula / footnote and verify each against the published SAP
10.2 (14-03-2025) and RdSAP 10 (10-06-2025) PDFs.
There is no "assessor judgement" knob to tune. Each field on the cert There is no "assessor judgement" knob to tune. Each field on the cert
has a deterministic interpretation per the spec. Each spec table / has a deterministic interpretation per the spec. Each spec table /
@ -57,109 +59,259 @@ all of them and verify each.
Tolerance: `|SAP residual| ≤ 1`, `|PE residual| ≤ 10 kWh/m²`. Known Tolerance: `|SAP residual| ≤ 1`, `|PE residual| ≤ 10 kWh/m²`. Known
caveat: some of these are compensating-error matches (e.g. cert caveat: some of these are compensating-error matches (e.g. cert
`7536-3827`'s PE matches but cost is £143 under cert's implied cost `7536-3827`'s PE matches but cost is £143 under cert's implied cost
due to multi-factor offsetting bugs). due to multi-factor offsetting bugs). **These fixtures are retired
per ADR-0010 and §10 below — they lock buggy compensating outputs
in place and will fight the spec sweep.**
> **Read this before anything else.** [ADR-0010](../adr/0010-sap10-calculator-spec-target-and-validation.md)
> supersedes the spec-version target, the PCDB sequencing, and the
> cert-calibration layer of ADR-0009. This handover document was
> originally written under the rejected framing; §3, §4, §7, §7b,
> §10 below have been rewritten in lockstep. §2.5 lists the five
> prerequisites that land **before** the section-by-section sweep
> starts.
--- ---
## 3. Why we are pivoting to systematic review ## 2.5. Prerequisites before the sweep starts
Five blockers, in dependency order. The section sweep does not start
until all five are merged. Together they convert the parity probe
from a noisy mixture-distribution signal into a clean per-section
verification tool.
### P1 — Re-extract the training parquet with `inspection_date`
The 250k-cert parquet has 202 columns; **none of them are dates**.
Without `inspection_date` on each cert we cannot construct the
Validation Cohort (P3). The ETL currently drops the dates; add them
back as a non-breaking MINOR Feature Schema Version bump (per
ADR-0008). `EpcPropertyData.inspection_date` and `.registration_date`
both exist on the domain object and are populated upstream — the
parquet writer just needs to include them.
### P2 — Delete `domain.sap.tables.table_12_cert_calibration`; correct `domain.sap.tables.table_12`
Per ADR-0010 §2 and §1:
- Remove `table_12_cert_calibration.py` and every call site
(`cert_calibration_prices()`, `cert_calibration_e7_codes`, the
`PriceTable` constructor argument that defaults to it).
- Re-label `table_12.py` as `SAP 10.2 Table 12 (14-03-2025 amendment)`.
- Correct CO2 factors: mains gas 0.214 → **0.210**, standard electricity 0.086 → **0.136** (the file currently mixes SAP 10.2 prices with SAP 10.3 CO2 factors).
- Delete the misleading "+25 % shift from SAP 10.2" comment — 13.19 p
is SAP 10.1 (or SAP 10.2 amendment 0), not SAP 10.2 (14-03-2025).
### P3 — Filter the parity probe to the Validation Cohort
`Validation Cohort` is defined in `CONTEXT.md` and ADR-0010 §3:
`inspection_date ≥ 2025-07-01`. Modify
`services/ml_training_data/src/ml_training_data/sap_parity_probe.py`
to apply the filter before sampling. The probe sample size and seed
remain configurable; `sap_score ∈ [5, 99]` remains the typicality
filter on top of the cohort filter.
### P4 — Implement `PcdbLookup` (replace `NoOpPcdbLookup`)
Per ADR-0010 §4. Download boiler + heat-pump CSVs from
https://www.ncm-pcdb.org.uk. Build a lookup keyed on
`main_heating_index_number`. Surface seasonal efficiency, secondary
efficiency, output kW, and (for HPs) flow-temperature curve. ~half-day
of work per the original handover estimate. The
`Sap10Calculator.__init__(pcdb: Optional[PcdbLookup])` seam from
ADR-0009 grill outcome #1 is the integration point; no calculator-side
changes needed beyond reading `index_number` and routing PCDB-returns
to space-heating / hot-water efficiency lookups instead of Table 4a.
### P5 — Populate `SapResult.intermediate` + transcribe BRE worked examples
Per ADR-0010 "Verification infrastructure":
- Populate every named SAP 10.2 worksheet variable on
`SapResult.intermediate` as sketched in §11. This is mechanical —
thread the values from each worksheet module into the dict.
- Transcribe the BRE worked examples from the SAP 10.2 appendices and
RdSAP 10 worked-example annex into unit tests
(`tests/test_bre_worked_examples.py`) that lock per-intermediate
values, not aggregate SAP. These replace the retired cert fixtures.
### P6 — Strict-type `EpcPropertyData` via canonical domain enums
The current `EpcPropertyData` and its nested types carry many bare
`str` fields and `Union[int, str]` fields (the latter because the
gov API gives ints and Site Notes give strings). The defensive
type-handling cascades into the calculator (`cert_to_inputs.py`,
`dimensions.py`, etc.) — `dimensions.py:74-82` is Khalim's documented
example: `SapBuildingPart.identifier` carries main-vs-extension
information but is bare `str`, so the dimensions code defensively
iterates instead of dispatching on a typed kind.
The fix:
1. **One canonical enum per field**, union of all keys appearing
across all schema versions in
`datatypes/epc/domain/epc_codes.csv`. Hand-author the 18 enum
classes (`built_form`, `construction_age_band`, `energy_tariff`,
`glazed_area`, `glazed_type`, `heat_loss_corridor`, `main_fuel`,
`mechanical_ventilation`, `property_type`, `tenure`,
`transaction_type`, `ventilation_type`, `water_heating_fuel`,
`cylinder_insulation_thickness`, `energy_efficiency_rating`,
`improvement_description`, `improvement_summary`, `code`) plus
`BuildingPartKind` (Main Dwelling / Extension N). codes.csv is
the reference; a dedup script can optionally verify coverage but
is not a build dependency.
2. **The API mapper** parses raw ints into the canonical enum.
3. **The Site Notes mapper** parses raw strings into the canonical
enum.
4. **The domain object** (`EpcPropertyData` and nested) holds only
the canonical enums — no `Union[int, str]`, no bare `str` for
coded fields.
5. **Every consumer** (calculator, ML pipeline, recommendations,
ETL, scenario builder) reads from the typed fields.
**Constraint**: repo-wide tests must keep passing. The calculator
is one consumer; the ML pipeline, recommendations, and the Site
Notes ingestion path also consume `EpcPropertyData`. Each mapper-
layer change is paired with adapter updates that preserve the
behaviour the existing tests cover.
Pyright `strict` mode must remain clean (CLAUDE.md).
### Expected outcome of P1P6
After all six land, run the probe against the Validation Cohort. The
expected baseline MAE on the clean probe is much smaller than the
current 4.61 — likely 1.52.5 SAP-points based on what we know about
the residual breakdown (heat pumps closed by P4, gas boilers tightened
by P4, price-version noise removed by P2+P3). The remaining residual
is the genuine spec sweep target — and per-section fixes will move
the probe in measurable, distinguishable amounts because there's no
compensating layer to mask them, and there's no defensive type
branching obscuring which input value drove which intermediate.
---
## 3. Why the prior diagnosis was wrong and how we fixed it
The prior session shipped ten slices (S-B23 → S-B31) by debugging the The prior session shipped ten slices (S-B23 → S-B31) by debugging the
biggest residuals one at a time: biggest residuals one at a time:
- **PE MAE dropped substantially: 57.28 → 43.32 (14)** — real progress - **PE MAE dropped substantially: 57.28 → 43.32 (14)** — real progress
on the demand-side calculation. on the demand-side calculation.
- **SAP MAE barely moved: 5.34 → 4.61 (0.73)** — the cost-side is - **SAP MAE barely moved: 5.34 → 4.61 (0.73)** — diagnosed at the time
bottlenecked by cert-calibration prices that absorb multiple as "cert-calibration absorbs multiple spec deviations".
structural deviations from spec, making any single slice that fixes
one component break the calibration for others.
Two failed slice attempts in the prior session exposed the pattern: Three slice attempts looked like they "proved" the cert-cal-absorbs-
deviations diagnosis:
- **Standing charges**: spec note Table 12 (a) clearly says gas standing - **Standing charges**: spec Table 12 note (a) requires £92/yr gas
charge of £92 is added to space + water heating costs for energy standing charge on space + water heating. Adding it pushed SAP bias
ratings. Empirically: adding it pushed SAP bias from +0.98 to 2.62. +0.98 → 2.62. Reverted.
Reverted before committing. - **Cat=10 room heaters off-peak routing**: Table 12a says "other
- **Cat=10 room heaters off-peak routing**: Table 12a clearly says direct-acting electric heating" bills 100 % high rate on 7-hour
"Other direct-acting electric heating" bills 100% high rate on tariff. Switching cat=10 from off-peak to standard rate inverted
7-hour tariff. Empirically: switching cat=10 from off-peak to the bias +5.88 → 6.00. Reverted.
standard rate inverted the bias from +5.88 to 6.00 without - **HW cylinder zero-loss for combi** (uncommitted): Table 2 + Table
improving MAE. Reverted before committing. 3 footers require zero storage + primary loss when efficiency comes
- **Hot water cylinder loss (uncommitted)**: spec Table 2 footer + from Table 4b. Zeroing them dropped PE MAE 6.64 but raised SAP
Table 3 footer clearly say combi boilers using Table 4b efficiency MAE +0.39 and broke 3 of 7 golden fixtures. Reverted.
have zero storage + primary loss. Empirically: zeroing them dropped
PE MAE 6.64 (huge improvement) but raised SAP MAE +0.39 AND broke
3 of 7 golden fixtures. Reverted because no way to know whether to
follow spec (PE-correct) or Elmhurst (SAP-MAE-correct) without
reference traces.
The pattern: **the cert-calibration prices** (in The prior agent concluded: *cert-calibration absorbs Elmhurst's
`domain.sap.tables.table_12_cert_calibration`) **were reverse-engineered deviations from spec — we can't fix one without re-deriving the
to match Elmhurst's output assuming all our other calculations are calibration, so do a full spec sweep first and re-derive cert-cal at
correct.** When we fix a spec-violation bug in some other component, we the end.* This diagnosis is **wrong** and the proposed remedy
break the calibration and SAP MAE goes up even though we're more amplifies the problem.
spec-correct.
This means **whack-a-mole on the biggest residual won't converge**. We ### What was actually going on
need to systematically verify every component against the spec, then
re-derive the cert-calibration once at the end. The 250k-cert corpus spans multiple SAP spec-version regimes:
- **Pre-2025-03-14**: certs lodged under SAP 10.1 / SAP 10.2 amendment
0 prices — mains gas ~3.48 p, standard electricity 13.19 p.
- **Post-2025-03-14**: certs lodged under SAP 10.2 (14-03-2025) prices
— mains gas 3.64 p, standard electricity 16.49 p.
The `table_12_cert_calibration` prices (3.48 p / 13.19 p) are **the
older spec's prices**, not Elmhurst deviations from the spec. They
are an empirical "best fit" across a mixture distribution of two
price regimes, with downstream-component bugs (PCDB absence, HW
cylinder loss applied to combi, etc.) absorbed into the fit. The
table looks like compensation for assessor-software quirks because we
were never told which spec each cert was on.
Each "spec-correct fix that worsened MAE" in the failed slices above
was actually correct. The MAE regressed because:
1. The cert-cal prices (pre-March-2025 spec) cancelled with one set
of downstream errors to produce a quasi-stable cost.
2. The spec-correct fix landed → that cancellation broke → the
probe MAE went up.
3. But the spec-correct fix was *right* — what regressed was a
compensating-error equilibrium, not the calculator's truth.
The prior session's "re-derive cert-cal at the end" plan would
re-establish a new compensating-error equilibrium across the new bug
set. It does not converge on spec-correctness.
### The fix (per ADR-0010)
1. **Stop fitting against a mixture distribution.** Filter the
validation corpus to a single spec-version window (Validation
Cohort, `inspection_date ≥ 2025-07-01`). Every cert in the cohort
was lodged on SAP 10.2 (14-03-2025) prices.
2. **Delete the cert-calibration layer.** Use spec prices everywhere
(`domain.sap.tables.table_12`). The only price-routing decision
left is Table 12a fractional high-rate blending — a real spec
feature, not a calibration.
3. **Build PCDB**, because it dominates residual variance and the
reason it was deferred (cert-cal-absorbs-PCDB) no longer holds.
4. **Build trace mode and BRE worked-example fixtures**, so
per-section verification works against single-cert intermediates
instead of aggregate corpus MAE.
This is what §2.5 lists as the five prerequisites. Once they land,
the section-by-section spec sweep produces clean, monotonic
improvements.
--- ---
## 4. Scope decisions ## 4. Scope decisions (per ADR-0010)
### IN scope ### IN scope
- **RdSAP 10 specification (10-06-2025)** — full document, all sections - **SAP 10.2 (14-03-2025 amendment)** is the active spec target.
(`docs/sap-spec/rdsap-10-specification-2025-06-10.pdf`, 114 pages). `docs/sap-spec/sap-10-2-full-specification-2025-03-14.pdf`, 199 pages.
- **SAP 10.2 full specification (14-03-2025)** — the worksheet, tables, - **RdSAP 10 (10-06-2025)** — the cert→input mapping layer that
appendices that RdSAP 10 references cross-references SAP 10.2. `docs/sap-spec/rdsap-10-specification-2025-06-10.pdf`,
(`docs/sap-spec/sap-10-2-full-specification-2025-03-14.pdf`, 199 pages). 114 pages.
- **PCDB integration.** Moved from "Session C deferred" to **P4
prerequisite** (§2.5). Heat pumps and the 78 % of gas-boiler certs
lodging `main_heating_data_source=1` need PCDB-sourced efficiency
for the calculator to be spec-correct. Data source:
https://www.ncm-pcdb.org.uk; lookup keyed on `main_heating_index_number`;
fields: seasonal efficiency, secondary efficiency, output kW,
flow-temperature curve (HPs).
- **All RdSAP 10 sections in document order.** §1 → §§19, plus
Tables 27 / 28 / 29 / 30 / 31. The verification approach in §5 is
unchanged — only the precondition changes: the sweep runs against a
clean probe (Validation Cohort + spec prices + PCDB + trace mode).
### OUT of scope (for now) ### OUT of scope
- **Full SAP assessments.** Full-SAP certs lodge a measured/calculated - **Full SAP assessments.** Full-SAP certs lodge measured/calculated
U-value in `walls[i].description` (e.g. U-values in `walls[i].description` (e.g.
"Average thermal transmittance 0.18 W/m²K"). These are a separate "Average thermal transmittance 0.18 W/m²K"). These are a separate
calculation path (BS EN ISO 6946) and a different corpus. **Park them calculation path (BS EN ISO 6946) and a different corpus. Park
until the RdSAP 10 base case matches Elmhurst.** S-B24 / S-B29 until the RdSAP 10 base case parity is reached. S-B24 / S-B29
attempted partial handling; those slices can stay or be reverted at attempted partial handling; those slices can stay or be reverted at
your discretion when you reach §§4-7 of RdSAP and §3 of SAP 10.2. your discretion when you reach §§4-7 of RdSAP and §3 of SAP 10.2.
- **PCDB (Product Characteristics Database).** ADR-0009 deferred this - **SAP 10.3 (13-01-2026).** No SAP-10.3-lodged certs in the corpus,
to Session C. **This is a real future task, not a permanent so it cannot be validated. Calculator targets SAP 10.2 until the
exclusion.** Heat pumps (cat=4) have catastrophic per-cert MAE (19 corpus migrates (expected late 2026 / 2027 once BRE updates RdSAP
SAP points) because we use Table 4a fallback efficiency 2.30 to reference SAP 10.3). Note: `table_12.py` currently mixes SAP
instead of PCDB SCOP (typically 2.80-3.50). Gas boilers with 10.2 prices with SAP 10.3 CO2 factors — corrected as part of P2.
`main_heating_data_source=1` (78% of corpus boiler certs) fall back - **Historical-spec cert reproduction.** Calculating what cert SAP
to a category-default 0.80 vs typical PCDB-listed condensing-boiler *would have been* under SAP 10.1 / pre-March-2025 SAP 10.2 prices is
efficiencies of 0.88-0.94 — that's most of the per-cert SAP residual not the calculator's job. Lodged Performance carries the historical
variance on gas certs. value; Calculated SAP10 Performance is current-spec only. The
Validation Cohort filter operationalises this — older certs are
A `NoOpPcdbLookup` stub seam exists in Session A (per ADR-0009 grill out of the validation loop, not because they're "wrong" but because
outcome #1). The fetch+parse work is non-trivial: they're a different spec's output.
- **Data source**: BRE PCDB at https://www.ncm-pcdb.org.uk — - **Re-deriving cert-cal at the end.** The prior session's plan. The
boilers + heat pumps are downloadable CSVs (thousands of rows cert-calibration layer is deleted in P2, not re-fit.
each).
- **Lookup key**: cert lodges `main_heating_index_number` which is
the PCDB product ID. Match by that.
- **Per-product fields needed**: seasonal efficiency, secondary
efficiency, output kW, flow-temperature curve (for HPs).
- **Effort**: ~half-day for the lookup + tests; ongoing maintenance
when BRE publishes new PCDB revisions.
**Recommended sequencing**: complete the systematic RdSAP spec
sweep first. Once the spec-correct engine is built and cert-cal
re-derived, PCDB integration should drop heat-pump residuals from
19 SAP points to ~1, and tighten the gas-boiler residual variance.
At that point heat pumps (cat=4) and PCDB-listed boilers
(`main_heating_data_source=1`) become accessible.
**Why not now**: the cert-calibration prices currently absorb the
missing PCDB efficiency (HP costs at off-peak rate compensates for
too-low SCOP). Fixing PCDB without re-deriving cert-cal would push
HP certs in the wrong direction. Same lesson as the other reverted
fixes in §7b — fix the spec layer first, the calibration layer
later.
- **SAP 10.3** (13-01-2026). The corpus is SAP 10.2. SAP 10.3 has
identical Table 12 codes (only values shift). Don't update spec
references to 10.3 until the corpus migrates.
--- ---
@ -208,22 +360,49 @@ For each table, formula, footnote, exception:
3. Are there spec-defined edge cases / footnotes we're missing? 3. Are there spec-defined edge cases / footnotes we're missing?
### 5.4. When a gap is found ### 5.4. When a gap is found
- Write a failing unit test that asserts the spec-correct behaviour. - Write a failing unit test that asserts the spec-correct behaviour
— wherever possible, write it as an assertion on `intermediate`
values rather than on aggregate SAP, using a BRE worked example
if one covers the section.
- Implement the fix. - Implement the fix.
- Run **all 7 golden fixtures** plus the broader probe. Note both - Run `test_bre_worked_examples.py` plus the Validation Cohort
direction and magnitude of change. probe. Note both direction and magnitude of change.
- If the fix is spec-correct but breaks a golden fixture, this is - If a BRE worked-example breaks, the new code is wrong (revert).
evidence that the fixture was a compensating-error case — proceed BRE examples are spec-derived and cannot regress from a
with the spec-correct fix and update the fixture (with a comment spec-correct change.
noting it was a compensating case). - Commit per-slice: one section → one commit. Reference the spec
- Commit per-slice as before: one section → one commit. Reference the section in the commit message.
spec section in the commit message.
### 5.5. Use trace mode when you need it ### 5.5. Sweep-time principle: worksheet-faithful structure
ADR-0009 specifies a `SapResult.intermediate: dict[str, float]` field
that was never populated. Adding this is highly recommended for the Each `worksheet/*.py` module must mirror the SAP 10.2 worksheet
systematic pass — each section's verification benefits from structure for its section. As you verify a section, also restructure
inspecting the intermediate values. See §11 below for a sketch. its module so that:
1. **Each function name references its worksheet-line origin** (e.g.
`heat_transfer_coefficient` aligns with worksheet line (40);
`mean_internal_temperature` aligns with worksheet line (93)).
2. **Compound calculations are split** into one function per
worksheet line where possible — easier to verify against
`intermediate[...]` and against BRE worked-example values.
3. **Defensive type-handling disappears**. Once P6 lands, the input
is a typed enum or numeric — branching on `isinstance(x, int)` is
replaced by enum dispatch.
4. **Domain-typed inputs flow directly**. `SapBuildingPart.kind ==
BuildingPartKind.MAIN_DWELLING` replaces string sniffing of
`identifier`. The dimensions.py "unnecessarily complicated"
pattern Khalim flagged is the canonical example of what *not*
to do.
The principle applies during section-sweep slices. It is **not**
a separate prerequisite — the refactor lands with the verification
slice for the section it touches.
### 5.6. Use trace mode when you need it
P5 populates `SapResult.intermediate: dict[str, float]` with every
named SAP 10.2 worksheet variable. Each section's verification
benefits from inspecting these values per-cert. See §11 below for
the sketch.
--- ---
@ -350,59 +529,60 @@ touched and what the current state is.
--- ---
## 7. The cert-calibration vs spec-correctness tension ## 7. The cert-calibration "tension" is dissolved (per ADR-0010)
This is THE central architectural decision you have to make as you This section originally framed cert-calibration vs spec-correctness as
work through the spec. two end-states the calculator had to choose between. That framing is
wrong (see §3 for the actual diagnosis): the cert-cal values are
pre-March-2025 SAP prices, not Elmhurst deviations from SAP 10.2.
Once the corpus is filtered to the Validation Cohort (P3) and the
cert-cal layer is deleted (P2), the false dichotomy disappears.
### Two tables of fuel prices ### What replaces this section
- `domain.sap.tables.table_12.UNIT_PRICE_P_PER_KWH` — SAP 10.2 spec
values (3.64p gas, 16.49p standard elec).
- `domain.sap.tables.table_12_cert_calibration.UNIT_PRICE_P_PER_KWH`
— empirically lower values (3.48p gas, 13.19p elec) that match the
cert assessor software's output.
### Two possible end states for the calculator - **One price table.** `domain.sap.tables.table_12` (re-labelled SAP
10.2 14-03-2025 amendment, CO2 factors corrected per P2).
- **One validation cohort.** `inspection_date ≥ 2025-07-01`, every
cert lodged on the calculator's target spec version.
- **One verification mechanism.** Trace-mode intermediates + BRE
worked-example unit tests for per-section verification; Validation
Cohort probe MAE for aggregate go/no-go.
**End state A — Spec-perfect.** Use spec prices, apply every spec rule Cert-software deviations from spec, if they exist at all, are
(standing charges, Table 12a fractions, combi zero-loss, etc.). The expected to be small and localised. They surface as residual after
calculator output is then what a *correct SAP 10.2 implementation* the spec sweep completes against a clean probe — and at that point
would produce. SAP MAE against the corpus will likely worsen because the question is whether to chase them at all (Elmhurst-deviation
Elmhurst doesn't perfectly implement spec. fixes have low domain value compared to spec-correctness, given the
calculator's product use case is scoring counterfactuals for the
**End state B — Elmhurst-perfect.** Use cert-cal prices and reproduce MeasureApplicator chain, not reproducing historical certs).
Elmhurst's deviations exactly. The calculator output matches cert
SAP scores. The calculator becomes a "reverse-engineered Elmhurst
clone" rather than a SAP 10.2 implementation.
### The pragmatic recommendation
**Aim for state A but track state B as the parity probe.** Concretely:
1. Verify each spec section in isolation; fix spec violations
regardless of MAE impact, but commit each fix WITH a measured
probe delta in the commit message.
2. After the spec sweep is complete, the calculator's output is
spec-correct. The corpus residual at that point is Elmhurst's
deviation from spec.
3. THEN re-derive the cert-calibration prices to match Elmhurst's
deviation pattern. The calibration becomes a thin Elmhurst-
compatibility layer on top of a spec-correct engine.
This avoids the whack-a-mole problem because state A is unambiguous:
each fix is either spec-correct or not. State B is iterative on top
of state A, not entangled with it.
--- ---
## 7b. Outstanding findings to pick up during the systematic pass ## 7b. Outstanding findings to pick up during the systematic pass
The prior session identified several spec-correct fixes that were The prior session identified several spec-correct fixes that were
**reverted because they made SAP MAE worse against the corpus, but the reverted because they made SAP MAE worse against the **full corpus**.
spec basis is unambiguous and the fixes WILL be the right answer once The empirical signal that "reverted" them was version-mixture noise
the cert-calibration is re-derived against a clean engine.** Treat (see §3) plus compensating-error breakage in the 7 retired golden
these as TODOs the systematic pass should encounter when it reaches fixtures. Each fix below is **expected to land cleanly** once the
the relevant section. They're listed here so the work isn't lost. five prerequisites in §2.5 are done, because:
- The Validation Cohort (P3) is on a single spec version — the price
mismatch that drove the bias regression on standing charges and
cat=10 routing disappears.
- The cert-cal layer is gone (P2) — no calibration to "break".
- PCDB is integrated (P4) — the heat-pump and gas-boiler residuals
that dominated per-cert MAE collapse before any of these findings
even matter.
- The fixtures are now BRE worked examples (P5 + §10) — they cannot
be broken by spec-correct changes because they are themselves
derived from the spec.
Treat each finding as a section-sweep TODO. The empirical impacts
below were measured against the **dirty probe** (full corpus + cert-cal
+ no PCDB) and are **not predictive** of behaviour on the clean probe.
Re-measure each fix against the Validation Cohort after prerequisites
land.
### Finding 1 — HW cylinder zero-loss rule for combi boilers ### Finding 1 — HW cylinder zero-loss rule for combi boilers
**Status**: spec-correct fix exists in working-tree-only form **Status**: spec-correct fix exists in working-tree-only form
@ -587,32 +767,45 @@ direction.
## 8. Don't repeat — known dead-ends ## 8. Don't repeat — known dead-ends
> **Re-read after §3 + §7b.** Three entries below were classified as
> "dead-ends because cert-cal absorbs" — that diagnosis is wrong.
> They are spec-correct fixes that were measured under a noisy probe.
> Now flagged as **conditional dead-ends**: dead only if you try them
> before P1P5 land. After prerequisites: they are expected
> improvements, not dead-ends. See ADR-0010.
- ❌ **Switching "NI" wall thickness to None alone** (S-B5 in history) — - ❌ **Switching "NI" wall thickness to None alone** (S-B5 in history) —
over-corrected because it routed to the (Unfilled cavity, 50mm) row over-corrected because it routed to the (Unfilled cavity, 50mm) row
instead of the dedicated Filled cavity row. The right fix landed in instead of the dedicated Filled cavity row. The right fix landed in
S-B23 with a `WALL_INSULATION_FILLED_CAVITY` dispatcher. S-B23 with a `WALL_INSULATION_FILLED_CAVITY` dispatcher.
- ❌ **Aggressive efficiency rescue for missing `sap_main_heating_code`** - ❌ **Aggressive efficiency rescue for missing `sap_main_heating_code`**
(S-B5) — over-corrected. The category fallback (cat=4 → 2.30) is (S-B5) — over-corrected. The category fallback (cat=4 → 2.30) is
intentionally conservative; PCDB is needed for real efficiency. intentionally conservative; PCDB (P4 prerequisite) supplies the
- ❌ **Using SAP 10.2 spec prices for parity validation** — cert assessor real efficiency.
uses lower prices despite reporting `sap_version=10.2` (S-B9, S-B10). - ⚠️ **Using SAP 10.2 spec prices for parity validation** — under
Use `cert_calibration_prices()` for the probe. the dirty probe, cert-cal prices fit better. **Inverts under the
clean probe (P2 + P3): SAP 10.2 spec prices are correct because the
Validation Cohort is on the 14-03-2025 amendment.** Listed here
only as a warning if you start the sweep before prerequisites land.
- ❌ **Always applying 10% secondary heating** — must be conditional on - ❌ **Always applying 10% secondary heating** — must be conditional on
cert lodging or main system being electric storage (S-B20). See cert lodging or main system being electric storage (S-B20). See
spec Appendix A.4. spec Appendix A.4.
- ❌ **Respecting `main_heating_fraction` for secondary allocation** - ❌ **Respecting `main_heating_fraction` for secondary allocation**
(failed S-B30) — the field is the multi-main allocation (system 1 vs (failed S-B30) — the field is the multi-main allocation (system 1 vs
system 2), not main-vs-secondary. SAP MAE 4.69 → 4.85 (worse). system 2), not main-vs-secondary. SAP MAE 4.69 → 4.85 (worse).
- ❌ **Switching cat=10 room heaters off off-peak** (failed S-B32) — - ⚠️ **Switching cat=10 room heaters off off-peak** (failed S-B32) —
spec-correct per Table 12a but inverts bias direction. Cert-cal spec-correct per Table 12a. The bias inversion under the dirty
calibration absorbs the deviation. probe was driven by cert-cal compensating; on the clean probe this
- ❌ **Adding gas standing charges** (4-mode probe, unimplemented) — is just spec-correct. Land as part of the §12 spec sweep after
spec-correct per Table 12 note (a) but pushes SAP bias from +0.98 prerequisites.
to 2.62. Cert-cal calibration absorbs. - ⚠️ **Adding gas standing charges** (4-mode probe, unimplemented) —
- ❌ **Zeroing storage + primary loss for combi boilers** (uncommitted spec-correct per Table 12 note (a). Same logic: bias drift under
S-B32) — spec-correct per Table 2 + Table 3 footers and drops PE dirty probe is version-mixture + missing-PCDB noise, not Elmhurst
MAE 6.64 (huge win) BUT raises SAP MAE +0.39 and breaks 3 golden deviation. Land as part of §12 spec sweep.
fixtures. Decision deferred to systematic pass. - ⚠️ **Zeroing storage + primary loss for combi boilers** (uncommitted
S-B32) — spec-correct per Table 2 + Table 3 footers. SAP MAE
regression was driven by the now-retired golden fixtures (§10) and
cert-cal absorption. Land as part of §4 / Appendix J sweep.
--- ---
@ -620,10 +813,15 @@ direction.
### Sample ### Sample
`data/ml_training/runs/2025_2026_n250000_v18a/data.parquet` is the `data/ml_training/runs/2025_2026_n250000_v18a/data.parquet` is the
250k-cert parquet. The probe filters to `sap_score ∈ [5, 99]` and 250k-cert parquet. **After P1 lands** the parquet carries
`inspection_date`; the probe then filters to the **Validation Cohort**
(`inspection_date ≥ 2025-07-01`) plus `sap_score ∈ [5, 99]` and
samples 300 at seed=7 by default. Filtering rationale: samples 300 at seed=7 by default. Filtering rationale:
- ≤ 5 is heritage/anomaly stock (sub-3% of corpus) - ≤ 5 is heritage/anomaly stock (sub-3 % of corpus)
- ≥ 99 is full-SAP new-builds the parquet excludes anyway - ≥ 99 is full-SAP new-builds the parquet excludes anyway
- `inspection_date ≥ 2025-07-01` ensures every cert was lodged on
SAP 10.2 (14-03-2025 amendment) — see [CONTEXT.md](../../CONTEXT.md)
/ "Validation Cohort" and ADR-0010 §3.
### Run the probe ### Run the probe
```bash ```bash
@ -654,38 +852,58 @@ main(['300','7'])
--- ---
## 10. The 7 golden fixtures ## 10. Fixtures: retire the 7 cert-based golden fixtures, replace with BRE worked examples (per ADR-0010 + P5)
The 7 cert-based fixtures at
`packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py` `packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py`
locks 7 corpus certs as regression anchors: were locked in against the current calculator state — *with* cert-cal,
*without* PCDB, *with* HW cylinder loss always applied, *with* the
lighting heuristic, etc. They are documented in §3 / the prior
handover as containing compensating errors. Once the prerequisites
land, every spec-correct fix breaks at least one of them. They will
fight the spec sweep.
| Cert | TFA | Cat | Notes | ### Replacement strategy
|---|---|---|---|
| `0240-0200-5706-2365-8010` | 202 | 2 | Detached, age J, oil boiler, Table 4b code 130 |
| `0300-2747-7640-2526-2135` | 526 | 2 | Semi-detached, age D, gas PCDB |
| `0390-2954-3640-2196-4175` | 360 | 2 | Detached, age F, oil PCDB |
| `6035-7729-2309-0879-2296` | 128 | 2 | Mid-terrace, age A, gas combi code 104 |
| `7536-3827-0600-0600-0276` | 152 | 2 | Detached + extensions, age D, gas PCDB. Cleanest PE match (0.29 kWh/m²) |
| `8135-1728-8500-0511-3296` | 102 | 2 | Semi-detached, age C, gas PCDB |
| `9390-2722-3520-2105-8715` | 75 | 6 | Mid-floor flat, age D, heat network code 301 |
Tolerance: `|SAP residual| ≤ 1`, `|PE residual| ≤ 10`. **Tighten as **Primary regression suite: BRE worked-example fixtures.**
the spec sweep progresses.**
The cert JSONs are stored under `fixtures/golden/<cert>.json` Transcribe the worked examples from:
frozen at extraction time so the test is reproducible without - SAP 10.2 spec appendices (especially Appendix R — reference values
bulk-zip access. The probe extraction script for new fixtures is and the worked example dwelling).
inlined in the test history (see commit `f4a8d2a0`). - RdSAP 10 (10-06-2025) worked-example annex.
**Important caveat**: some of these 7 are compensating-error matches Each worked example becomes a unit test that locks **per-intermediate
(see §3). When a spec-correct slice breaks one, the fixture is expected values** (HLP, HTC, mean internal temperature monthly, MIT,
probably the compensating case — investigate before reverting. ECF, SAP score) rather than the aggregate SAP score alone. Because
they are spec-derived, no spec-correct change can break them — any
break is an implementation bug, unambiguously.
These tests live at
`packages/domain/src/domain/sap/tests/test_bre_worked_examples.py`
(new module — separate from the cert-based fixtures module).
**Cert-based fixtures retired.**
The current `test_golden_fixtures.py` is either deleted or repurposed
as a *very loose* smoke-test integration suite (e.g. `|SAP residual|
≤ 5`) that catches catastrophic regressions only. The 7 cert JSONs
under `fixtures/golden/<cert>.json` can be kept on disk as reference
data, but they no longer drive go/no-go decisions in the sweep.
**Optional future addition.**
If/when a current Elmhurst (or Stroma / Quidos / NHER) license is
available, run a handful of representative corpus certs through it
and lock those outputs as a second-tier regression suite — Elmhurst-
parity fixtures alongside spec-parity fixtures. Not a prerequisite.
--- ---
## 11. Trace mode (recommended infrastructure) ## 11. Trace mode (prerequisite P5 — implementation sketch)
ADR-0009 proposed: This section was originally labelled "recommended"; it is now
**prerequisite P5** per ADR-0010. The sweep does not start until
`intermediate` is populated everywhere. ADR-0009 proposed:
```python ```python
@dataclass(frozen=True) @dataclass(frozen=True)
class SapResult: class SapResult:
@ -765,14 +983,39 @@ This single session should produce zero behaviour changes if §1-3 are
correctly implemented, but expect to find at least one issue in §3 correctly implemented, but expect to find at least one issue in §3
geometry (per the reviewer's "biggest SAP error sources" list). geometry (per the reviewer's "biggest SAP error sources" list).
Run the golden fixtures + probe at the end of each session; expect no **Important:** Session 1 only starts after all five prerequisites in
movement until you start hitting actual gaps. §2.5 have landed and the Validation Cohort probe baseline has been
captured. Until then, running per-section verification produces noisy
signal.
Run the BRE worked-example fixtures (P5) + Validation Cohort probe
(P3) at the end of each session; expect no movement until you start
hitting actual gaps.
--- ---
## 13. Workflow recap ## 13. Workflow recap
For each section, in order: **Phase 0 — Prerequisites (§2.5).** Land P1P6 first, in dependency
order:
| | Slice | Depends on |
|---|---|---|
| P1 | Re-extract parquet with `inspection_date` | — |
| P2 | Delete cert-cal; correct `table_12.py` CO2 factors | — |
| P3 | Filter parity probe to Validation Cohort | P1 |
| P4 | Implement `PcdbLookup` | — (P2 helpful) |
| P5 | Populate `SapResult.intermediate` + transcribe BRE worked examples | — |
| P6 | Strict-type `EpcPropertyData` via codes.csv-derived enums | — |
P1, P2, P4, P5, P6 can run in parallel. P3 needs P1. Capture a
Validation Cohort probe baseline once all six land — that is the new
MAE starting line. Repo-wide tests stay green throughout P6 (Site
Notes consumers, ML pipeline, recommendations, etc. all need the
mapper updates that accompany each typing change).
**Phase 1 — Section sweep.** For each RdSAP 10 section, in document
order:
1. Read the spec section text + cited tables. 1. Read the spec section text + cited tables.
2. Identify code location(s). 2. Identify code location(s).
@ -780,22 +1023,36 @@ For each section, in order:
- Does our code implement it? - Does our code implement it?
- Does the implementation match? - Does the implementation match?
- Edge cases / fallback paths handled? - Edge cases / fallback paths handled?
4. For each gap: AAA unit test → minimal implementation → commit. 4. For each gap: AAA unit test (preferring a BRE worked-example
5. After each commit: run golden fixtures (`pytest test_golden_fixtures.py`) assertion on `intermediate` values when possible) → minimal
and the parity probe. Note both deltas in the commit message. implementation → commit.
6. If a golden fixture breaks: investigate. Either fixture was a 5. **Apply the worksheet-faithful structure principle** (§5.5) as
compensating case (acceptable to break) or the new code is wrong part of this slice: name functions after worksheet lines, split
(revert). compound calculations, replace any remaining defensive
type-handling with typed-enum dispatch.
6. After each commit: run `test_bre_worked_examples.py` + Validation
Cohort probe. Note both deltas in the commit message.
7. If a BRE worked-example breaks: the new code is wrong (revert).
The worked examples are spec-derived and cannot be broken by
spec-correct changes.
Stick to this. The prior session's mistake was jumping between Stick to this. The prior session's mistake was jumping between
sections based on residual-size. Don't. sections based on residual-size **on a dirty probe**. Clean probe
plus document-order discipline plus worksheet-faithful structure is
what makes the sweep converge.
--- ---
## 14. Useful references ## 14. Useful references
- **ADR-0010** `docs/adr/0010-sap10-calculator-spec-target-and-validation.md`
— the binding decisions reflected in this rewrite: SAP 10.2 target,
cert-cal deletion, Validation Cohort, PCDB-as-prerequisite, fixture
retirement. **Read first.**
- **ADR-0009** `docs/adr/0009-deterministic-sap-calculator.md` - **ADR-0009** `docs/adr/0009-deterministic-sap-calculator.md`
decision rationale + Session A/B/C plan. original calculator decision rationale + Session A/B/C plan. Read
for context; spec-version target / PCDB sequencing / cert-cal
rationale are superseded by ADR-0010.
- **Spec coverage map** - **Spec coverage map**
`docs/sap-spec/SPEC_COVERAGE.md` — pre-existing coverage tracker. `docs/sap-spec/SPEC_COVERAGE.md` — pre-existing coverage tracker.
Update as you go. Update as you go.
@ -817,19 +1074,19 @@ sections based on residual-size. Don't.
## 15. Final note ## 15. Final note
The prior session demonstrated that **moving SAP MAE down requires The prior session's framing — *"the cert-calibration layer absorbs
either spec-correctness OR Elmhurst-perfect calibration, not both Elmhurst's spec deviations; we'll re-derive it at the end"* — was
simultaneously**. The cert-cal layer absorbs Elmhurst's spec load-bearing on a false diagnosis. The cert-cal layer is
deviations; any spec-correct fix risks breaking it. pre-March-2025 SAP prices fit against a mixture distribution of two
spec-version regimes. Once you separate the regimes (Validation
Cohort) and use spec prices everywhere, the "tension" disappears.
The systematic pass clears this by separating the layers: After P1P5 land, the section sweep is straightforward: every
1. Build the spec-correct engine first. spec-correct fix is unambiguously the right answer, BRE
2. Re-fit the cert-cal compatibility layer once at the end. worked-example fixtures lock the result, and Validation Cohort probe
MAE moves monotonically downward. The fixes the prior session marked
as "spec-correct but probe-regressed" become trivially landable.
Don't be discouraged when SAP MAE rises temporarily during the spec **Welcome to the project. Read ADR-0010, land the five prerequisites,
sweep. PE residual is the truer signal of engine correctness. SAP then walk the spec in document order. The deterministic answer is in
MAE convergence will follow once cert-cal is re-derived against the there.**
clean engine.
**Welcome to the project. Read the spec, follow the order, commit one
section at a time. The deterministic answer is in there.**

View file

@ -21,7 +21,6 @@ from typing import Final
from datatypes.epc.domain.epc_property_data import EpcPropertyData, SapBuildingPart from datatypes.epc.domain.epc_property_data import EpcPropertyData, SapBuildingPart
_DEFAULT_STOREY_HEIGHT_M: Final[float] = 2.5 _DEFAULT_STOREY_HEIGHT_M: Final[float] = 2.5
@ -72,6 +71,16 @@ def dimensions_from_cert(epc: EpcPropertyData) -> Dimensions:
"""Build the `Dimensions` aggregate from an EpcPropertyData.""" """Build the `Dimensions` aggregate from an EpcPropertyData."""
parts = epc.sap_building_parts or [] parts = epc.sap_building_parts or []
# Khalim Comments - this section seems to implement the
# worksheet section in page 132 and is unnecessarily
# complicated. The sap building parts are pre-ordered, form
# main building part to the extensions and the
# "identifier" field tells us if the part is the Main Dwelling
# of it's an extension. E.g. if it's an extension, identifier
# should be "Extension 1".
# We should strictly type the values on the EpcPropertyData
# domain model
ground_area = 0.0 ground_area = 0.0
ground_perim = 0.0 ground_perim = 0.0
top_area = 0.0 top_area = 0.0