Model/docs/sap-spec/HANDOVER_SYSTEMATIC_REVIEW.md
Khalim Conn-Kowlessar bb9c5ac017 docs: ADR-0010 retargets calculator to SAP 10.2; rewrite handover
Adds ADR-0010 superseding ADR-0009's spec-version target, PCDB
sequencing, and cert-calibration layer. Captures the conclusions
of a grill-with-docs session:

  1. Active spec target is SAP 10.2 (14-03-2025), not SAP 10.3 — no
     SAP-10.3-lodged certs exist in the corpus to validate against.
  2. table_12_cert_calibration is deleted (not "re-derived at the
     end"). It was pre-March-2025 spec prices fit against a mixture
     distribution of two spec-version regimes, with downstream-
     component bugs absorbed into the fit — not Elmhurst deviation.
  3. Validation Cohort: filter the corpus to inspection_date ≥
     2025-07-01 so every cert in the probe was lodged on SAP 10.2
     (14-03-2025) prices. One spec, one signal.
  4. PCDB integration is promoted from "Session C deferred" to
     prerequisite P4 — dominates residual variance on heat pumps and
     the 78% of gas-boiler certs lodging main_heating_data_source=1.
  5. Trace mode (SapResult.intermediate) and BRE worked-example
     fixtures replace the 7 cert-based golden fixtures, which
     contained compensating errors.
  6. Strict-type EpcPropertyData via codes.csv-derived canonical
     enums (P6) — the in-source motivation lives at
     dimensions.py:74-82 (Khalim's comment, included in this commit).
  7. Worksheet-faithful structure is a sweep-time principle: each
     worksheet module mirrors SAP 10.2 worksheet line numbering.

CONTEXT.md additions:
  - Refined "Calculated SAP10 Performance" and "SAP10 Calculation"
    to reference SAP 10.2 + ADR-0010.
  - New term "SAP Spec Version" — domain-meaningful because the
    same EpcPropertyData yields different sap_score under different
    spec revisions.
  - New term "Validation Cohort" — the version-locked sub-corpus.

HANDOVER_SYSTEMATIC_REVIEW.md is rewritten section-by-section to
reflect ADR-0010: §1 framing, §2 status pointer, new §2.5 with the
six prerequisites P1–P6 in dependency order, §3 diagnosis (cert-cal
was stale prices, not Elmhurst deviation), §4 scope (PCDB IN,
SAP 10.3 stays OUT), §5 approach (worksheet-faithful principle as
§5.5), §7 tension dissolved, §7b findings re-framed, §8 dead-ends
re-classified as conditional, §9 cohort filter, §10 fixture
strategy, §11 trace mode as prerequisite, §12 prereqs-first,
§13 Phase 0/Phase 1 workflow, §14 ADR-0010 reference, §15 final
note.

P2.1 (commit ac1aa56a) already lands the first ADR-0010 slice
(probe swap to spec prices).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 09:54:24 +00:00

50 KiB
Raw Blame History

Handover — Systematic Section-by-Section RdSAP 10 / SAP 10.2 Review

Audience: A fresh agent picking up the deterministic SAP calculator at packages/domain/src/domain/sap/. Read this first, then the spec PDFs, then the code.

Goal: Match the cert software (Elmhurst / Stroma / etc.) output exactly for RdSAP 10 / SAP 10.2 input certs. This is a deterministic, mechanical calculation — not a model — so MAE should approach zero on certs whose inputs are fully populated.


1. Critical framing — this is NOT a judgement call

The SAP/RdSAP energy assessment splits cleanly into two roles:

  1. The assessor — a person who surveys the dwelling and lodges measured/observed fields onto the cert (areas, perimeters, construction codes, insulation thicknesses, fuel types, etc.). The assessor makes NO calculation decisions.
  2. The cert software (Elmhurst, Stroma, Quidos, NHER, ECMK) — a deterministic implementation of the RdSAP 10 + SAP 10.2 specs. It takes the lodged fields and produces SAP score, CO2 emissions, primary energy (PEUI), CO2 per m², EI rating, etc.

Our calculator is replicating role #2. Assessor software implements the SAP 10.2 spec faithfully; the question of "where does Elmhurst diverge from spec?" is no longer the operative one (per ADR-0010 + §3 below). Our job is to enumerate every spec table / formula / footnote and verify each against the published SAP 10.2 (14-03-2025) and RdSAP 10 (10-06-2025) PDFs.

There is no "assessor judgement" knob to tune. Each field on the cert has a deterministic interpretation per the spec. Each spec table / formula has a deterministic implementation. Our job is to enumerate all of them and verify each.


2. Current state (2026-05-19)

  • Branch: ara-backend-design-prd

  • Last clean commit: f4a8d2a0 ("tests: golden-fixture regression set — 7 currently-correct corpus certs")

  • 301 tests passing

  • Parity probe (300 random certs from data/ml_training/runs/2025_2026_n250000_v18a/data.parquet, seed=7, sap_score ∈ [5, 99]):

    Metric Value
    SAP MAE 4.61
    SAP bias +0.87
    PE MAE 43.32 kWh/m²
    PE bias +37.69 kWh/m²
  • 7 "golden" regression certs locked in packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py. Tolerance: |SAP residual| ≤ 1, |PE residual| ≤ 10 kWh/m². Known caveat: some of these are compensating-error matches (e.g. cert 7536-3827's PE matches but cost is £143 under cert's implied cost due to multi-factor offsetting bugs). These fixtures are retired per ADR-0010 and §10 below — they lock buggy compensating outputs in place and will fight the spec sweep.

Read this before anything else. ADR-0010 supersedes the spec-version target, the PCDB sequencing, and the cert-calibration layer of ADR-0009. This handover document was originally written under the rejected framing; §3, §4, §7, §7b, §10 below have been rewritten in lockstep. §2.5 lists the five prerequisites that land before the section-by-section sweep starts.


2.5. Prerequisites before the sweep starts

Five blockers, in dependency order. The section sweep does not start until all five are merged. Together they convert the parity probe from a noisy mixture-distribution signal into a clean per-section verification tool.

P1 — Re-extract the training parquet with inspection_date

The 250k-cert parquet has 202 columns; none of them are dates. Without inspection_date on each cert we cannot construct the Validation Cohort (P3). The ETL currently drops the dates; add them back as a non-breaking MINOR Feature Schema Version bump (per ADR-0008). EpcPropertyData.inspection_date and .registration_date both exist on the domain object and are populated upstream — the parquet writer just needs to include them.

P2 — Delete domain.sap.tables.table_12_cert_calibration; correct domain.sap.tables.table_12

Per ADR-0010 §2 and §1:

  • Remove table_12_cert_calibration.py and every call site (cert_calibration_prices(), cert_calibration_e7_codes, the PriceTable constructor argument that defaults to it).
  • Re-label table_12.py as SAP 10.2 Table 12 (14-03-2025 amendment).
  • Correct CO2 factors: mains gas 0.214 → 0.210, standard electricity 0.086 → 0.136 (the file currently mixes SAP 10.2 prices with SAP 10.3 CO2 factors).
  • Delete the misleading "+25 % shift from SAP 10.2" comment — 13.19 p is SAP 10.1 (or SAP 10.2 amendment 0), not SAP 10.2 (14-03-2025).

P3 — Filter the parity probe to the Validation Cohort

Validation Cohort is defined in CONTEXT.md and ADR-0010 §3: inspection_date ≥ 2025-07-01. Modify services/ml_training_data/src/ml_training_data/sap_parity_probe.py to apply the filter before sampling. The probe sample size and seed remain configurable; sap_score ∈ [5, 99] remains the typicality filter on top of the cohort filter.

P4 — Implement PcdbLookup (replace NoOpPcdbLookup)

Per ADR-0010 §4. Download boiler + heat-pump CSVs from https://www.ncm-pcdb.org.uk. Build a lookup keyed on main_heating_index_number. Surface seasonal efficiency, secondary efficiency, output kW, and (for HPs) flow-temperature curve. ~half-day of work per the original handover estimate. The Sap10Calculator.__init__(pcdb: Optional[PcdbLookup]) seam from ADR-0009 grill outcome #1 is the integration point; no calculator-side changes needed beyond reading index_number and routing PCDB-returns to space-heating / hot-water efficiency lookups instead of Table 4a.

P5 — Populate SapResult.intermediate + transcribe BRE worked examples

Per ADR-0010 "Verification infrastructure":

  • Populate every named SAP 10.2 worksheet variable on SapResult.intermediate as sketched in §11. This is mechanical — thread the values from each worksheet module into the dict.
  • Transcribe the BRE worked examples from the SAP 10.2 appendices and RdSAP 10 worked-example annex into unit tests (tests/test_bre_worked_examples.py) that lock per-intermediate values, not aggregate SAP. These replace the retired cert fixtures.

P6 — Strict-type EpcPropertyData via canonical domain enums

The current EpcPropertyData and its nested types carry many bare str fields and Union[int, str] fields (the latter because the gov API gives ints and Site Notes give strings). The defensive type-handling cascades into the calculator (cert_to_inputs.py, dimensions.py, etc.) — dimensions.py:74-82 is Khalim's documented example: SapBuildingPart.identifier carries main-vs-extension information but is bare str, so the dimensions code defensively iterates instead of dispatching on a typed kind.

The fix:

  1. One canonical enum per field, union of all keys appearing across all schema versions in datatypes/epc/domain/epc_codes.csv. Hand-author the 18 enum classes (built_form, construction_age_band, energy_tariff, glazed_area, glazed_type, heat_loss_corridor, main_fuel, mechanical_ventilation, property_type, tenure, transaction_type, ventilation_type, water_heating_fuel, cylinder_insulation_thickness, energy_efficiency_rating, improvement_description, improvement_summary, code) plus BuildingPartKind (Main Dwelling / Extension N). codes.csv is the reference; a dedup script can optionally verify coverage but is not a build dependency.
  2. The API mapper parses raw ints into the canonical enum.
  3. The Site Notes mapper parses raw strings into the canonical enum.
  4. The domain object (EpcPropertyData and nested) holds only the canonical enums — no Union[int, str], no bare str for coded fields.
  5. Every consumer (calculator, ML pipeline, recommendations, ETL, scenario builder) reads from the typed fields.

Constraint: repo-wide tests must keep passing. The calculator is one consumer; the ML pipeline, recommendations, and the Site Notes ingestion path also consume EpcPropertyData. Each mapper- layer change is paired with adapter updates that preserve the behaviour the existing tests cover.

Pyright strict mode must remain clean (CLAUDE.md).

Expected outcome of P1P6

After all six land, run the probe against the Validation Cohort. The expected baseline MAE on the clean probe is much smaller than the current 4.61 — likely 1.52.5 SAP-points based on what we know about the residual breakdown (heat pumps closed by P4, gas boilers tightened by P4, price-version noise removed by P2+P3). The remaining residual is the genuine spec sweep target — and per-section fixes will move the probe in measurable, distinguishable amounts because there's no compensating layer to mask them, and there's no defensive type branching obscuring which input value drove which intermediate.


3. Why the prior diagnosis was wrong and how we fixed it

The prior session shipped ten slices (S-B23 → S-B31) by debugging the biggest residuals one at a time:

  • PE MAE dropped substantially: 57.28 → 43.32 (14) — real progress on the demand-side calculation.
  • SAP MAE barely moved: 5.34 → 4.61 (0.73) — diagnosed at the time as "cert-calibration absorbs multiple spec deviations".

Three slice attempts looked like they "proved" the cert-cal-absorbs- deviations diagnosis:

  • Standing charges: spec Table 12 note (a) requires £92/yr gas standing charge on space + water heating. Adding it pushed SAP bias +0.98 → 2.62. Reverted.
  • Cat=10 room heaters off-peak routing: Table 12a says "other direct-acting electric heating" bills 100 % high rate on 7-hour tariff. Switching cat=10 from off-peak to standard rate inverted the bias +5.88 → 6.00. Reverted.
  • HW cylinder zero-loss for combi (uncommitted): Table 2 + Table 3 footers require zero storage + primary loss when efficiency comes from Table 4b. Zeroing them dropped PE MAE 6.64 but raised SAP MAE +0.39 and broke 3 of 7 golden fixtures. Reverted.

The prior agent concluded: cert-calibration absorbs Elmhurst's deviations from spec — we can't fix one without re-deriving the calibration, so do a full spec sweep first and re-derive cert-cal at the end. This diagnosis is wrong and the proposed remedy amplifies the problem.

What was actually going on

The 250k-cert corpus spans multiple SAP spec-version regimes:

  • Pre-2025-03-14: certs lodged under SAP 10.1 / SAP 10.2 amendment 0 prices — mains gas ~3.48 p, standard electricity 13.19 p.
  • Post-2025-03-14: certs lodged under SAP 10.2 (14-03-2025) prices — mains gas 3.64 p, standard electricity 16.49 p.

The table_12_cert_calibration prices (3.48 p / 13.19 p) are the older spec's prices, not Elmhurst deviations from the spec. They are an empirical "best fit" across a mixture distribution of two price regimes, with downstream-component bugs (PCDB absence, HW cylinder loss applied to combi, etc.) absorbed into the fit. The table looks like compensation for assessor-software quirks because we were never told which spec each cert was on.

Each "spec-correct fix that worsened MAE" in the failed slices above was actually correct. The MAE regressed because:

  1. The cert-cal prices (pre-March-2025 spec) cancelled with one set of downstream errors to produce a quasi-stable cost.
  2. The spec-correct fix landed → that cancellation broke → the probe MAE went up.
  3. But the spec-correct fix was right — what regressed was a compensating-error equilibrium, not the calculator's truth.

The prior session's "re-derive cert-cal at the end" plan would re-establish a new compensating-error equilibrium across the new bug set. It does not converge on spec-correctness.

The fix (per ADR-0010)

  1. Stop fitting against a mixture distribution. Filter the validation corpus to a single spec-version window (Validation Cohort, inspection_date ≥ 2025-07-01). Every cert in the cohort was lodged on SAP 10.2 (14-03-2025) prices.
  2. Delete the cert-calibration layer. Use spec prices everywhere (domain.sap.tables.table_12). The only price-routing decision left is Table 12a fractional high-rate blending — a real spec feature, not a calibration.
  3. Build PCDB, because it dominates residual variance and the reason it was deferred (cert-cal-absorbs-PCDB) no longer holds.
  4. Build trace mode and BRE worked-example fixtures, so per-section verification works against single-cert intermediates instead of aggregate corpus MAE.

This is what §2.5 lists as the five prerequisites. Once they land, the section-by-section spec sweep produces clean, monotonic improvements.


4. Scope decisions (per ADR-0010)

IN scope

  • SAP 10.2 (14-03-2025 amendment) is the active spec target. docs/sap-spec/sap-10-2-full-specification-2025-03-14.pdf, 199 pages.
  • RdSAP 10 (10-06-2025) — the cert→input mapping layer that cross-references SAP 10.2. docs/sap-spec/rdsap-10-specification-2025-06-10.pdf, 114 pages.
  • PCDB integration. Moved from "Session C deferred" to P4 prerequisite (§2.5). Heat pumps and the 78 % of gas-boiler certs lodging main_heating_data_source=1 need PCDB-sourced efficiency for the calculator to be spec-correct. Data source: https://www.ncm-pcdb.org.uk; lookup keyed on main_heating_index_number; fields: seasonal efficiency, secondary efficiency, output kW, flow-temperature curve (HPs).
  • All RdSAP 10 sections in document order. §1 → §§19, plus Tables 27 / 28 / 29 / 30 / 31. The verification approach in §5 is unchanged — only the precondition changes: the sweep runs against a clean probe (Validation Cohort + spec prices + PCDB + trace mode).

OUT of scope

  • Full SAP assessments. Full-SAP certs lodge measured/calculated U-values in walls[i].description (e.g. "Average thermal transmittance 0.18 W/m²K"). These are a separate calculation path (BS EN ISO 6946) and a different corpus. Park until the RdSAP 10 base case parity is reached. S-B24 / S-B29 attempted partial handling; those slices can stay or be reverted at your discretion when you reach §§4-7 of RdSAP and §3 of SAP 10.2.
  • SAP 10.3 (13-01-2026). No SAP-10.3-lodged certs in the corpus, so it cannot be validated. Calculator targets SAP 10.2 until the corpus migrates (expected late 2026 / 2027 once BRE updates RdSAP to reference SAP 10.3). Note: table_12.py currently mixes SAP 10.2 prices with SAP 10.3 CO2 factors — corrected as part of P2.
  • Historical-spec cert reproduction. Calculating what cert SAP would have been under SAP 10.1 / pre-March-2025 SAP 10.2 prices is not the calculator's job. Lodged Performance carries the historical value; Calculated SAP10 Performance is current-spec only. The Validation Cohort filter operationalises this — older certs are out of the validation loop, not because they're "wrong" but because they're a different spec's output.
  • Re-deriving cert-cal at the end. The prior session's plan. The cert-calibration layer is deleted in P2, not re-fit.

5. The approach — section-by-section spec verification

Work through the RdSAP 10 spec in document order, starting at §1. For each section:

5.1. Read the spec section

Read the section text fully. Note every rule, table reference, and defaulting cascade.

5.2. Find the corresponding code

Map the section to the source file(s) implementing it. The current mapping (some sections are split across modules):

RdSAP 10 section Code location
§1 Introduction / general n/a
§2 Property descriptors datatypes/epc/domain/epc_property_data.py
§3 Dimensions packages/domain/src/domain/sap/worksheet/dimensions.py
§4 Ventilation packages/domain/src/domain/sap/worksheet/ventilation.py
§5 Construction / U-values packages/domain/src/domain/ml/rdsap_uvalues.py + worksheet/heat_transmission.py
§6 Windows / doors / overshading worksheet/solar_gains.py + rdsap/cert_to_inputs.py
§7 Heating systems (refers to SAP 10.2 Appendix A) domain.ml.sap_efficiencies + rdsap/cert_to_inputs.py
§8 Heating controls (Table 4e) rdsap/cert_to_inputs.py
§9 Heat emitters / flow temperatures not implemented
§10 Space and water heating (Appendix A) rdsap/cert_to_inputs.py
§11 Additional items (PV, batteries, wind, hydro, shutters) partial in cert_to_inputs.py (PV only)
§12 Electricity tariff rdsap/cert_to_inputs.py (_is_off_peak_meter, fuel routing)
§13 Addendum to EPCs n/a
§14 Special cases (e.g. flats above commercial) not implemented
§15 Improvements (recommendations) n/a (not rating)
§16-19 RdSAP-specific SAP rating equations worksheet/rating.py
Table 27 — Living-area fraction rdsap/cert_to_inputs.py:_living_area_fraction
Table 28 — Cylinder size defaults domain.ml.demand:_CYLINDER_VOLUME_L
Table 29 — Heating + HW parameters partial in cert_to_inputs.py
Table 30 — Mechanical ventilation not implemented
Table 31 — Data to be collected n/a

5.3. For each spec rule in the section, check our code

For each table, formula, footnote, exception:

  1. Does our code implement it?
  2. Does the implementation match the spec values exactly?
  3. Are there spec-defined edge cases / footnotes we're missing?

5.4. When a gap is found

  • Write a failing unit test that asserts the spec-correct behaviour — wherever possible, write it as an assertion on intermediate values rather than on aggregate SAP, using a BRE worked example if one covers the section.
  • Implement the fix.
  • Run test_bre_worked_examples.py plus the Validation Cohort probe. Note both direction and magnitude of change.
  • If a BRE worked-example breaks, the new code is wrong (revert). BRE examples are spec-derived and cannot regress from a spec-correct change.
  • Commit per-slice: one section → one commit. Reference the spec section in the commit message.

5.5. Sweep-time principle: worksheet-faithful structure

Each worksheet/*.py module must mirror the SAP 10.2 worksheet structure for its section. As you verify a section, also restructure its module so that:

  1. Each function name references its worksheet-line origin (e.g. heat_transfer_coefficient aligns with worksheet line (40); mean_internal_temperature aligns with worksheet line (93)).
  2. Compound calculations are split into one function per worksheet line where possible — easier to verify against intermediate[...] and against BRE worked-example values.
  3. Defensive type-handling disappears. Once P6 lands, the input is a typed enum or numeric — branching on isinstance(x, int) is replaced by enum dispatch.
  4. Domain-typed inputs flow directly. SapBuildingPart.kind == BuildingPartKind.MAIN_DWELLING replaces string sniffing of identifier. The dimensions.py "unnecessarily complicated" pattern Khalim flagged is the canonical example of what not to do.

The principle applies during section-sweep slices. It is not a separate prerequisite — the refactor lands with the verification slice for the section it touches.

5.6. Use trace mode when you need it

P5 populates SapResult.intermediate: dict[str, float] with every named SAP 10.2 worksheet variable. Each section's verification benefits from inspecting these values per-cert. See §11 below for the sketch.


6. What's already been done — section by section

This is your starting map. Each row says whether the section has been touched and what the current state is.

Walls / construction (§5)

  • S-B23 (committed 9a509e41): Table 6 "Filled cavity" row dispatch when wall_insulation_type=2 AND wall_construction=4. Spec-anchored.
  • S-B24 (committed 15613309): Parse walls[i].description for "Average thermal transmittance X W/m²K". PARK — full-SAP path.
  • S-B25 (committed 6b934710): Description-based dispatch for cavity "as built, insulated (assumed)" + similar (type=4 with descriptive signal). Spec-anchored via legacy epc_wall_description_map.
  • S-B26 (committed 361f9154): _insulation_bucket(0, True) → 50 fix (the "NI" thickness sentinel) + description-based override of wall_ins_present for non-cavity walls. Spec footnote (Table 6).
  • S-B27 (committed 1f49fa03): Floor _insulation_bucket analog — Table 19 footnote (2) "max(50, age-band default)" when description signals retrofit.
  • S-B28 (committed 25261d5c): Roof NI thickness + insulated description → §5.11.4 footnote 50mm joist row.
  • S-B29 (committed 3ab09845): Floor + roof "Average thermal transmittance" parse. PARK — full-SAP path.

Still to verify in §5:

  • Stone wall U-values for Scotland / Wales / NIR (Tables 7-10) — only England is fully transcribed; country overrides are partial.
  • Cob U-values (§5.6) — table only, no formula implementation.
  • Stone formula §5.6 / §5.7 for non-standard wall thicknesses.
  • Curtain wall §5.18 — not implemented.
  • Party wall U-values (Table 15) — implemented in u_party_wall, verify table values.
  • Thermal bridging (Table 21) — implemented as global y factor, verify per-age-band values.
  • §5.16 Thermal mass — Table 22 (only 100 / 250 kJ/m²K, dispatched by construction type with internal insulation). Currently we hardcode 250 (see cert_to_inputs.py:_DEFAULT_THERMAL_MASS_PARAMETER_KJ_PER_M2_K). This is wrong for timber-frame / cob / internally-insulated masonry (should be 100).

Heating systems (§§7-10, SAP Appendix A)

  • S-B20 (in history): Table 11 secondary heating allocation, conditional on cert lodging secondary or being electric storage.
  • Failed S-B30 (reverted): respect main_heating_fraction — shown empirically wrong. Field is multi-main allocation, not main-vs-secondary. Spec verified against SAP 10.2 Appendix A1/A4.
  • S-B31 (committed afdf297f): Table 12c DLF on heat-network main. Spec §C3.1 + Table 12c.
  • Failed S-B32 (room heater off-peak routing, reverted): Table 12a says cat=10 room heaters on 7-hour tariff bill 100% high rate. Our cert-cal extends off-peak to codes 691-696. Spec-correct fix inverted bias direction — calibration was absorbing this.
  • Uncommitted HW cylinder fix: spec-correct (combi → zero storage/primary loss per Table 2 + Table 3 footers) but breaks 3 golden fixtures. Decision deferred to systematic pass.

Still to verify in heating:

  • Table 4a efficiency values for every code (heat pumps, storage heaters, room heaters, CPSU). The category-fallback (cat=4 → 2.30) is documented as a known limitation.
  • Boiler interlock penalty (5%) — spec §9.2.1: "The efficiency of gas and liquid fuel boilers for both space and water heating is reduced by 5% if the boiler is not interlocked for space and water heating." We don't apply this. Known gap.
  • Table 4c condensing-boiler / heat-pump emitter-temperature adjustment — we don't apply this.
  • Table 12a high-rate fractions for off-peak dwellings — we apply 100% off-peak or 100% standard, never fractional blending.

Hot water (§4 SAP + Appendix J)

  • Storage loss factor table (Table 2) — current values in domain.ml.demand:_STORAGE_LOSS_FACTOR are ~3× off from spec (verified). Known under-prediction of cylinder loss for storage systems; cancelled by over-prediction of primary loss for combi systems in aggregate.
  • Primary loss formula (Table 3) — implemented as 245/60 kWh by age band. Spec is a per-month formula nₘ × 14 × [{0.0091·p + 0.0245·(1-p)}·h + 0.0263] with p (pipework insulation fraction) and h (circulation hours). Known approximation.
  • Combi-boiler zero-loss rule (Table 2 + Table 3 footers) — currently NOT applied (the failed uncommitted slice). Adding this drops PE MAE 6.64 but raises SAP MAE +0.39.
  • Appendix J Vd formula 25N + 36 — currently the simple form, not the full per-component (shower / bath / other) breakdown. Useful HW demand is ~7% under spec value.
  • ΔT — currently 43°C constant (5512). Spec uses monthly Tcold and hot at 52°C, not 55°C. Per-month variance unmodelled.

Lighting (Appendix L)

  • predicted_lighting_kwh in domain.ml.demand uses 9.3 × TFA × (1 0.5·led_share 0.4·cfl_share) heuristic.
  • Spec is L1-L12: daylight correction, fixed-lighting capacity, top-up
    • portable shares, monthly profile.
  • For LED-dominant home (50+ LEDs): our heuristic gives ~465 kWh, spec gives ~94 kWh. Known over-prediction by ~5× for new-build LED homes.

Internal gains (§5 SAP)

  • worksheet/internal_gains.py implements metabolic + cooking + appliances + lighting (the four positive rows of Table 5).
  • Missing: Water heating row (1000 × (65)ₘ / (nₘ × 24) — i.e. HW losses recycled as heated-space gains) and Losses row (40 × N for cold inflow + evaporation). Both documented in S-B23 gap list.

Ventilation (§4 / Table 5)

  • Wind-shelter factor implemented in S-B21.
  • Mechanical ventilation (MVHR, MEV, PIV) — not implemented; cert rarely lodges. Spec §4.2 + Table 4g.
  • Pressure-test override (worksheet lines 17-18) — not implemented.

Tariff / cost (§12 + Table 12 / 12a / 12c)

  • Cert-calibration prices in domain.sap.tables.table_12_cert_calibration are an EMPIRICAL fit to Elmhurst's output. They are LOWER than the published Table 12 spec values by 4-25%. Known divergence; investigation deferred.
  • Standing charges (Table 12 note (a)) — NOT applied. Adding them empirically worsens MAE (calibration absorbs).
  • Table 12a high-rate fractions — currently 100% off-peak for E7- eligible codes, 100% standard otherwise. No fractional blending.
  • Heat network DLF (Table 12c) — applied per S-B31 only to main heating + HW from main. HW-only-from-heat-network is a separate slice.

7. The cert-calibration "tension" is dissolved (per ADR-0010)

This section originally framed cert-calibration vs spec-correctness as two end-states the calculator had to choose between. That framing is wrong (see §3 for the actual diagnosis): the cert-cal values are pre-March-2025 SAP prices, not Elmhurst deviations from SAP 10.2. Once the corpus is filtered to the Validation Cohort (P3) and the cert-cal layer is deleted (P2), the false dichotomy disappears.

What replaces this section

  • One price table. domain.sap.tables.table_12 (re-labelled SAP 10.2 14-03-2025 amendment, CO2 factors corrected per P2).
  • One validation cohort. inspection_date ≥ 2025-07-01, every cert lodged on the calculator's target spec version.
  • One verification mechanism. Trace-mode intermediates + BRE worked-example unit tests for per-section verification; Validation Cohort probe MAE for aggregate go/no-go.

Cert-software deviations from spec, if they exist at all, are expected to be small and localised. They surface as residual after the spec sweep completes against a clean probe — and at that point the question is whether to chase them at all (Elmhurst-deviation fixes have low domain value compared to spec-correctness, given the calculator's product use case is scoring counterfactuals for the MeasureApplicator chain, not reproducing historical certs).


7b. Outstanding findings to pick up during the systematic pass

The prior session identified several spec-correct fixes that were reverted because they made SAP MAE worse against the full corpus. The empirical signal that "reverted" them was version-mixture noise (see §3) plus compensating-error breakage in the 7 retired golden fixtures. Each fix below is expected to land cleanly once the five prerequisites in §2.5 are done, because:

  • The Validation Cohort (P3) is on a single spec version — the price mismatch that drove the bias regression on standing charges and cat=10 routing disappears.
  • The cert-cal layer is gone (P2) — no calibration to "break".
  • PCDB is integrated (P4) — the heat-pump and gas-boiler residuals that dominated per-cert MAE collapse before any of these findings even matter.
  • The fixtures are now BRE worked examples (P5 + §10) — they cannot be broken by spec-correct changes because they are themselves derived from the spec.

Treat each finding as a section-sweep TODO. The empirical impacts below were measured against the dirty probe (full corpus + cert-cal

  • no PCDB) and are not predictive of behaviour on the clean probe. Re-measure each fix against the Validation Cohort after prerequisites land.

Finding 1 — HW cylinder zero-loss rule for combi boilers

Status: spec-correct fix exists in working-tree-only form (uncommitted). Reverted at end of last session.

Spec basis:

  • SAP 10.2 Table 2 footer (page 158): "In the case of a combination boiler: a) the storage loss factor is zero if the efficiency is taken from Table 4b"
  • SAP 10.2 Table 3 footer (page 160): "Primary loss is set to zero for the following: Electric immersion heater, Combi boiler (including when it is part of a combined heat pump and boiler package and provides all the hot water), CPSU (including electric CPSU), Boiler and thermal store within a single casing, Separate boiler and thermal store connected by no more than 1.5 m of insulated pipework, Direct-acting electric boiler, Heat pump (not combined heat pump and boiler package with a non-combi boiler) from PCDB with hot water vessel integral to package"

The bug: our calculator currently adds storage loss (~135 kWh) and primary loss (~245 kWh) for ALL certs with an age band lodged, ignoring whether the dwelling has a cylinder. 67% of corpus certs explicitly lodge has_hot_water_cylinder=False (the modal combi boiler case) — we add 380 kWh of fictional HW losses for each.

The fix (sketch, ~10 lines):

  1. Add has_cylinder: bool = True keyword to predicted_hot_water_kwh in packages/domain/src/domain/ml/demand.py.
  2. When has_cylinder=False, set storage_loss = 0 and primary_loss = 0.
  3. In cert_to_inputs.py (around line 829), pass has_cylinder=epc.has_hot_water_cylinder and not is_instantaneous.

Empirical impact (measured on 300-cert probe):

  • PE MAE: 43.32 → 36.68 (6.64) ← biggest single fix found this session
  • PE bias: 37.69 → 30.41 (7.28)
  • SAP MAE: 4.61 → 5.00 (+0.39, regression)
  • 3 of 7 golden fixtures break

Why it was reverted: the SAP regression + broken fixtures indicate the fictional HW losses were partially compensating for OTHER bugs (likely lighting over-prediction for LED-dominant homes). The right ordering is: fix the spec-clear cases (HW cylinder, lighting per Appendix L, etc.) together, then re-derive cert-cal.

When to pick up: when you reach §4 / Appendix J during the systematic pass. Pair with the lighting Appendix L fix to avoid breaking the golden fixtures individually.

Finding 2 — Standing charges (Table 12 note (a))

Status: spec-correct, never implemented. Empirically rejected by 4-mode probe.

Spec basis: SAP 10.2 Table 12 note (a), page 190:

"For calculations including regulated energy uses only (e.g. regulation compliance, energy ratings):

  • The standing charge for electricity standard tariff is omitted
  • The standing charge for off-peak electricity is added to space and water heating costs where either main heating or hot water uses off-peak electricity
  • The standing charge for gas fuels is added to space and water heating costs where the gas fuel is used for space heating (main or secondary) or for water heating"

The bug: our calculator never adds standing charges. Per spec, a gas-heated dwelling should have £92/yr added to the ECF numerator.

Empirical impact (4-mode probe, 300 certs):

Mode All certs Gas-only
cert-cal, no standing (current) MAE 4.69, bias +0.98 MAE 4.01, bias +0.80
cert-cal + gas standing MAE 4.94, bias 2.62 MAE 4.31, bias 3.53

Adding standing charges shifts SAP bias by ~3.5 points downward — clearly the wrong direction. The cert-cal prices (3.48p gas vs spec 3.64p) implicitly absorb the standing-charge contribution.

When to pick up: when you reach §12 / Table 12. Apply alongside spec-correct unit prices (3.64p gas, 16.49p elec) and re-derive cert-cal to match Elmhurst's residual deviation pattern.

Finding 3 — Cat=10 room heaters off-peak routing

Status: spec-correct, currently bills room heaters at off-peak rate on E7 dwellings. Empirically rejected.

Spec basis: SAP 10.2 Table 12a (page 191):

"Other direct-acting electric heating (including electric secondary heating): 7-hour tariff 1.00 high rate; 10-hour tariff 0.50 high rate"

The bug: our cert-calibration (cert_calibration_e7_codes) extends the off-peak set to include codes 691-696 (room heaters). That's the S-B14 empirical extension — the previous agent found it helped some specific certs. Per Table 12a it's WRONG: room heaters on E7 should bill 100% at HIGH rate, not at low rate.

Empirical impact: switching from off-peak (5.50p cert-cal) to standard rate (13.19p) — closer to spec but still not the high rate (15.29p cert-cal) — inverted the bias from +5.88 to 6.00 without improving MAE.

The real issue: Table 12a defines FRACTIONAL blending (e.g. "90% high, 10% low" for direct-acting electric boiler on 7-hour tariff), not binary on/off-peak. Our calculator only supports binary. A proper implementation needs per-system high-rate fractions.

When to pick up: when you reach §12 / Table 12a. Implement fractional blending for all the rows of Table 12a, not just cat=10.

Finding 4 — Lighting (Appendix L proper)

Status: gap. Current code uses a 9.3 kWh/m² heuristic with simple LED/CFL reductions; spec is the L1-L12 cascade with daylight correction, fixed-lighting capacity, top-up + portable shares, monthly profile.

Spec basis: SAP 10.2 Appendix L §L1 (pages 88-90), equations L1-L12.

The bug: for a 100 m² LED-dominant home (e.g. cert 7536-3827 with 51 LEDs), our heuristic returns 465 kWh/yr; spec returns ~94 kWh/yr. Over-prediction by ~5× on LED-dominant homes (which is most modern stock).

Empirical impact (estimated):

  • ~5-6 kWh/m² PEUI over-prediction for LED-dominant population
  • Corpus-weighted: ~3-4 kWh/m² PEUI bias contribution

When to pick up: when you reach Appendix L. Pair with the HW cylinder fix (Finding 1) to avoid the SAP MAE regression.

Finding 5 — Internal-gains Table 5 missing rows

Status: gap. Spec Table 5 has 7 rows for internal gains; our worksheet/internal_gains.py implements 4.

Spec basis: SAP 10.2 Table 5 (page 177).

Missing rows:

  • Water heating: 1000 × (65)ₘ / (nₘ × 24) W — the HW losses (cylinder + distribution + primary) recycled as heated-space gains via worksheet line (65). Reduces space heating demand.
  • Losses: 40 × N W — heat to incoming cold water and evaporation. Negative contribution.

Empirical impact (estimated):

  • For N=2.7: HW gains ≈+75 W, losses ≈108 W, net ≈33 W. Currently we miss both → our gains are 33 W too high → space heating demand too low → PE under-predicted by ~3 kWh/m² (rough).

When to pick up: when you reach §5 / Table 5. Worksheet line (65) also needs implementation — the HW losses already exist in our calc (see demand.py:_cylinder_storage_loss_kwh etc.), they just need piping into internal_gains.

Finding 6 — Storage-loss-factor table values are wrong

Status: gap. Affects only certs with has_hot_water_cylinder=True (33% of corpus).

Spec basis: SAP 10.2 Table 2 (page 158).

The bug: domain.ml.demand:_STORAGE_LOSS_FACTOR values are ~3× LOWER than spec. E.g. for 38mm foam our value is 0.0056, spec is 0.0181. Effect: we UNDER-predict cylinder storage loss by ~300 kWh for storage systems, partly cancelling the over-prediction from Finding 1.

When to pick up: when you reach §4 / Table 2. Fix WITH Finding 1 (combi zero-loss) so the cancellation doesn't dominate the direction.

Finding 7 — Heat-pump fallback efficiency 2.30

Status: gap that requires PCDB. See §8b.

Finding 8 — Other smaller gaps (carry forward)

  • Boiler interlock 5% penalty (§9.2.1) — never applied
  • Table 4c condensing boiler / HP emitter temperature adjustment — never applied
  • Control-temperature adjustment from Table 4e — always 0 in code, spec varies
  • Wall U-values for Scotland / Wales / NIR — only England fully transcribed
  • Per-junction thermal bridging (Table R2) — global y approximation only
  • Multi-main heating (main_heating_fraction ≠ 1) — first main only
  • Cooling §10 — not implemented (rare in UK)
  • FEE §11 — not implemented (new-build only)

8. Don't repeat — known dead-ends

Re-read after §3 + §7b. Three entries below were classified as "dead-ends because cert-cal absorbs" — that diagnosis is wrong. They are spec-correct fixes that were measured under a noisy probe. Now flagged as conditional dead-ends: dead only if you try them before P1P5 land. After prerequisites: they are expected improvements, not dead-ends. See ADR-0010.

  • Switching "NI" wall thickness to None alone (S-B5 in history) — over-corrected because it routed to the (Unfilled cavity, 50mm) row instead of the dedicated Filled cavity row. The right fix landed in S-B23 with a WALL_INSULATION_FILLED_CAVITY dispatcher.
  • Aggressive efficiency rescue for missing sap_main_heating_code (S-B5) — over-corrected. The category fallback (cat=4 → 2.30) is intentionally conservative; PCDB (P4 prerequisite) supplies the real efficiency.
  • ⚠️ Using SAP 10.2 spec prices for parity validation — under the dirty probe, cert-cal prices fit better. Inverts under the clean probe (P2 + P3): SAP 10.2 spec prices are correct because the Validation Cohort is on the 14-03-2025 amendment. Listed here only as a warning if you start the sweep before prerequisites land.
  • Always applying 10% secondary heating — must be conditional on cert lodging or main system being electric storage (S-B20). See spec Appendix A.4.
  • Respecting main_heating_fraction for secondary allocation (failed S-B30) — the field is the multi-main allocation (system 1 vs system 2), not main-vs-secondary. SAP MAE 4.69 → 4.85 (worse).
  • ⚠️ Switching cat=10 room heaters off off-peak (failed S-B32) — spec-correct per Table 12a. The bias inversion under the dirty probe was driven by cert-cal compensating; on the clean probe this is just spec-correct. Land as part of the §12 spec sweep after prerequisites.
  • ⚠️ Adding gas standing charges (4-mode probe, unimplemented) — spec-correct per Table 12 note (a). Same logic: bias drift under dirty probe is version-mixture + missing-PCDB noise, not Elmhurst deviation. Land as part of §12 spec sweep.
  • ⚠️ Zeroing storage + primary loss for combi boilers (uncommitted S-B32) — spec-correct per Table 2 + Table 3 footers. SAP MAE regression was driven by the now-retired golden fixtures (§10) and cert-cal absorption. Land as part of §4 / Appendix J sweep.

9. The cert corpus and parity probe

Sample

data/ml_training/runs/2025_2026_n250000_v18a/data.parquet is the 250k-cert parquet. After P1 lands the parquet carries inspection_date; the probe then filters to the Validation Cohort (inspection_date ≥ 2025-07-01) plus sap_score ∈ [5, 99] and samples 300 at seed=7 by default. Filtering rationale:

  • ≤ 5 is heritage/anomaly stock (sub-3 % of corpus)
  • ≥ 99 is full-SAP new-builds the parquet excludes anyway
  • inspection_date ≥ 2025-07-01 ensures every cert was lodged on SAP 10.2 (14-03-2025 amendment) — see CONTEXT.md / "Validation Cohort" and ADR-0010 §3.

Run the probe

python -c "
import sys
sys.path.insert(0, 'packages/domain/src')
sys.path.insert(0, '.')
sys.path.insert(0, 'services/ml_training_data/src')
from ml_training_data.sap_parity_probe import main
main(['300','7'])
"

What the probe shows

  • Aggregate SAP MAE / RMSE / bias
  • Aggregate PE MAE / RMSE / bias
  • Per-end-use PEUI breakdown (space / HW / lighting / pumps)
  • Stratification by main_heating_category, construction_age_band, dwelling_type
  • Worst-15 residuals (SAP and PE)

Known parquet limitations

  • ~0.7% of parquet certs have construction_age_band=None vs 15% in the raw bulk-zip. The parquet filters out full-SAP new-builds upstream. Don't measure full-SAP-path slices against the parquet.
  • Heat-pump certs (cat=4) are under-represented and concentrated in the worst-residual tail because PCDB efficiency is unavailable.

10. Fixtures: retire the 7 cert-based golden fixtures, replace with BRE worked examples (per ADR-0010 + P5)

The 7 cert-based fixtures at packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py were locked in against the current calculator state — with cert-cal, without PCDB, with HW cylinder loss always applied, with the lighting heuristic, etc. They are documented in §3 / the prior handover as containing compensating errors. Once the prerequisites land, every spec-correct fix breaks at least one of them. They will fight the spec sweep.

Replacement strategy

Primary regression suite: BRE worked-example fixtures.

Transcribe the worked examples from:

  • SAP 10.2 spec appendices (especially Appendix R — reference values and the worked example dwelling).
  • RdSAP 10 (10-06-2025) worked-example annex.

Each worked example becomes a unit test that locks per-intermediate expected values (HLP, HTC, mean internal temperature monthly, MIT, ECF, SAP score) rather than the aggregate SAP score alone. Because they are spec-derived, no spec-correct change can break them — any break is an implementation bug, unambiguously.

These tests live at packages/domain/src/domain/sap/tests/test_bre_worked_examples.py (new module — separate from the cert-based fixtures module).

Cert-based fixtures retired.

The current test_golden_fixtures.py is either deleted or repurposed as a very loose smoke-test integration suite (e.g. |SAP residual| ≤ 5) that catches catastrophic regressions only. The 7 cert JSONs under fixtures/golden/<cert>.json can be kept on disk as reference data, but they no longer drive go/no-go decisions in the sweep.

Optional future addition.

If/when a current Elmhurst (or Stroma / Quidos / NHER) license is available, run a handful of representative corpus certs through it and lock those outputs as a second-tier regression suite — Elmhurst- parity fixtures alongside spec-parity fixtures. Not a prerequisite.


11. Trace mode (prerequisite P5 — implementation sketch)

This section was originally labelled "recommended"; it is now prerequisite P5 per ADR-0010. The sweep does not start until intermediate is populated everywhere. ADR-0009 proposed:

@dataclass(frozen=True)
class SapResult:
    sap_score: float
    ...
    intermediate: dict[str, float]

The intermediate field was never populated. Suggested implementation for the systematic pass:

intermediate = {
    # §1 dimensions
    "tfa_m2": tfa,
    "volume_m3": volume,
    "storey_count": storeys,
    # §3 heat transmission
    "walls_w_per_k": ht.walls_w_per_k,
    "roof_w_per_k": ht.roof_w_per_k,
    "floor_w_per_k": ht.floor_w_per_k,
    "party_walls_w_per_k": ht.party_walls_w_per_k,
    "windows_w_per_k": ht.windows_w_per_k,
    "doors_w_per_k": ht.doors_w_per_k,
    "thermal_bridging_w_per_k": ht.thermal_bridging_w_per_k,
    "infiltration_ach": infiltration,
    "infiltration_w_per_k": infiltration * volume * 0.33,
    "heat_transfer_coefficient_w_per_k": hlc,
    "heat_loss_parameter_w_per_m2k": hlp,
    "time_constant_h": tau_h,
    # §5 internal gains (annual averages)
    "internal_gains_annual_avg_w": ...,
    # §7 mean internal temperature (annual avg)
    "mean_internal_temp_annual_avg_c": ...,
    # §9 space heating
    "useful_space_heating_kwh_per_yr": space_heating_kwh,
    # §12 fuel costs (per end-use)
    "main_heating_cost_gbp": ...,
    "hot_water_cost_gbp": ...,
    "lighting_cost_gbp": ...,
    "pumps_fans_cost_gbp": ...,
    # §13 rating
    "ecf": ecf,
    "deflator": 0.36,
    # §14 primary energy and CO2 per end-use
    "space_heating_pe_kwh_per_m2": ...,
    "hot_water_pe_kwh_per_m2": ...,
    ...
}

Once populated, the differential debugging the reviewer recommended becomes possible: change one input field, compare deltas against an Elmhurst export.


12. Specific section-1 starting tasks (suggested first session)

A concrete pickup point:

Session 1 — §1 (Introduction), §2 (Property Descriptors), §3 (Dimensions)

  • §1 is prose; nothing to verify.
  • §2 maps to EpcPropertyData. Verify that every field RdSAP §2 enumerates is present and correctly typed on the domain object. Specifically check: dwelling_type, built_form, property_type, construction_age_band, country_code. Note that construction_age_band is per-building-part, not dwelling-level, and the primary age band drives most defaults.
  • §3 maps to worksheet/dimensions.py. Verify:
    • Total floor area sum across building parts equals TFA
    • Volume calculation per storey × area × height
    • Storey count handling for extensions and room-in-roof
    • Multi-storey heat-loss-perimeter rules

This single session should produce zero behaviour changes if §1-3 are correctly implemented, but expect to find at least one issue in §3 geometry (per the reviewer's "biggest SAP error sources" list).

Important: Session 1 only starts after all five prerequisites in §2.5 have landed and the Validation Cohort probe baseline has been captured. Until then, running per-section verification produces noisy signal.

Run the BRE worked-example fixtures (P5) + Validation Cohort probe (P3) at the end of each session; expect no movement until you start hitting actual gaps.


13. Workflow recap

Phase 0 — Prerequisites (§2.5). Land P1P6 first, in dependency order:

Slice Depends on
P1 Re-extract parquet with inspection_date
P2 Delete cert-cal; correct table_12.py CO2 factors
P3 Filter parity probe to Validation Cohort P1
P4 Implement PcdbLookup — (P2 helpful)
P5 Populate SapResult.intermediate + transcribe BRE worked examples
P6 Strict-type EpcPropertyData via codes.csv-derived enums

P1, P2, P4, P5, P6 can run in parallel. P3 needs P1. Capture a Validation Cohort probe baseline once all six land — that is the new MAE starting line. Repo-wide tests stay green throughout P6 (Site Notes consumers, ML pipeline, recommendations, etc. all need the mapper updates that accompany each typing change).

Phase 1 — Section sweep. For each RdSAP 10 section, in document order:

  1. Read the spec section text + cited tables.
  2. Identify code location(s).
  3. For each rule / table / footnote:
    • Does our code implement it?
    • Does the implementation match?
    • Edge cases / fallback paths handled?
  4. For each gap: AAA unit test (preferring a BRE worked-example assertion on intermediate values when possible) → minimal implementation → commit.
  5. Apply the worksheet-faithful structure principle (§5.5) as part of this slice: name functions after worksheet lines, split compound calculations, replace any remaining defensive type-handling with typed-enum dispatch.
  6. After each commit: run test_bre_worked_examples.py + Validation Cohort probe. Note both deltas in the commit message.
  7. If a BRE worked-example breaks: the new code is wrong (revert). The worked examples are spec-derived and cannot be broken by spec-correct changes.

Stick to this. The prior session's mistake was jumping between sections based on residual-size on a dirty probe. Clean probe plus document-order discipline plus worksheet-faithful structure is what makes the sweep converge.


14. Useful references

  • ADR-0010 docs/adr/0010-sap10-calculator-spec-target-and-validation.md — the binding decisions reflected in this rewrite: SAP 10.2 target, cert-cal deletion, Validation Cohort, PCDB-as-prerequisite, fixture retirement. Read first.
  • ADR-0009 docs/adr/0009-deterministic-sap-calculator.md — original calculator decision rationale + Session A/B/C plan. Read for context; spec-version target / PCDB sequencing / cert-cal rationale are superseded by ADR-0010.
  • Spec coverage map docs/sap-spec/SPEC_COVERAGE.md — pre-existing coverage tracker. Update as you go.
  • Parity findings docs/sap-spec/PARITY_FINDINGS.md — empirical findings from prior sessions.
  • Earlier handover docs/sap-spec/HANDOVER_FRESH_REVIEW.md — orientation from the previous fresh-context pass.
  • Reviewer feedback (informal) — chatGPT critique of the slice-by- slice approach. Key recommendations: two-layer architecture (RdSAP expansion → SAP worksheet), trace mode, golden-master methodology, differential debugging, reference traces from Elmhurst/Stroma/Quidos.
  • Commit loggit log --oneline shows the slice history; each S-Bxx commit message documents the spec ref + measured impact.

15. Final note

The prior session's framing — "the cert-calibration layer absorbs Elmhurst's spec deviations; we'll re-derive it at the end" — was load-bearing on a false diagnosis. The cert-cal layer is pre-March-2025 SAP prices fit against a mixture distribution of two spec-version regimes. Once you separate the regimes (Validation Cohort) and use spec prices everywhere, the "tension" disappears.

After P1P5 land, the section sweep is straightforward: every spec-correct fix is unambiguously the right answer, BRE worked-example fixtures lock the result, and Validation Cohort probe MAE moves monotonically downward. The fixes the prior session marked as "spec-correct but probe-regressed" become trivially landable.

Welcome to the project. Read ADR-0010, land the five prerequisites, then walk the spec in document order. The deterministic answer is in there.