Adds ADR-0010 superseding ADR-0009's spec-version target, PCDB
sequencing, and cert-calibration layer. Captures the conclusions
of a grill-with-docs session:
1. Active spec target is SAP 10.2 (14-03-2025), not SAP 10.3 — no
SAP-10.3-lodged certs exist in the corpus to validate against.
2. table_12_cert_calibration is deleted (not "re-derived at the
end"). It was pre-March-2025 spec prices fit against a mixture
distribution of two spec-version regimes, with downstream-
component bugs absorbed into the fit — not Elmhurst deviation.
3. Validation Cohort: filter the corpus to inspection_date ≥
2025-07-01 so every cert in the probe was lodged on SAP 10.2
(14-03-2025) prices. One spec, one signal.
4. PCDB integration is promoted from "Session C deferred" to
prerequisite P4 — dominates residual variance on heat pumps and
the 78% of gas-boiler certs lodging main_heating_data_source=1.
5. Trace mode (SapResult.intermediate) and BRE worked-example
fixtures replace the 7 cert-based golden fixtures, which
contained compensating errors.
6. Strict-type EpcPropertyData via codes.csv-derived canonical
enums (P6) — the in-source motivation lives at
dimensions.py:74-82 (Khalim's comment, included in this commit).
7. Worksheet-faithful structure is a sweep-time principle: each
worksheet module mirrors SAP 10.2 worksheet line numbering.
CONTEXT.md additions:
- Refined "Calculated SAP10 Performance" and "SAP10 Calculation"
to reference SAP 10.2 + ADR-0010.
- New term "SAP Spec Version" — domain-meaningful because the
same EpcPropertyData yields different sap_score under different
spec revisions.
- New term "Validation Cohort" — the version-locked sub-corpus.
HANDOVER_SYSTEMATIC_REVIEW.md is rewritten section-by-section to
reflect ADR-0010: §1 framing, §2 status pointer, new §2.5 with the
six prerequisites P1–P6 in dependency order, §3 diagnosis (cert-cal
was stale prices, not Elmhurst deviation), §4 scope (PCDB IN,
SAP 10.3 stays OUT), §5 approach (worksheet-faithful principle as
§5.5), §7 tension dissolved, §7b findings re-framed, §8 dead-ends
re-classified as conditional, §9 cohort filter, §10 fixture
strategy, §11 trace mode as prerequisite, §12 prereqs-first,
§13 Phase 0/Phase 1 workflow, §14 ADR-0010 reference, §15 final
note.
P2.1 (commit ac1aa56a) already lands the first ADR-0010 slice
(probe swap to spec prices).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
50 KiB
Handover — Systematic Section-by-Section RdSAP 10 / SAP 10.2 Review
Audience: A fresh agent picking up the deterministic SAP calculator at
packages/domain/src/domain/sap/. Read this first, then the spec PDFs,
then the code.
Goal: Match the cert software (Elmhurst / Stroma / etc.) output exactly for RdSAP 10 / SAP 10.2 input certs. This is a deterministic, mechanical calculation — not a model — so MAE should approach zero on certs whose inputs are fully populated.
1. Critical framing — this is NOT a judgement call
The SAP/RdSAP energy assessment splits cleanly into two roles:
- The assessor — a person who surveys the dwelling and lodges measured/observed fields onto the cert (areas, perimeters, construction codes, insulation thicknesses, fuel types, etc.). The assessor makes NO calculation decisions.
- The cert software (Elmhurst, Stroma, Quidos, NHER, ECMK) — a deterministic implementation of the RdSAP 10 + SAP 10.2 specs. It takes the lodged fields and produces SAP score, CO2 emissions, primary energy (PEUI), CO2 per m², EI rating, etc.
Our calculator is replicating role #2. Assessor software implements the SAP 10.2 spec faithfully; the question of "where does Elmhurst diverge from spec?" is no longer the operative one (per ADR-0010 + §3 below). Our job is to enumerate every spec table / formula / footnote and verify each against the published SAP 10.2 (14-03-2025) and RdSAP 10 (10-06-2025) PDFs.
There is no "assessor judgement" knob to tune. Each field on the cert has a deterministic interpretation per the spec. Each spec table / formula has a deterministic implementation. Our job is to enumerate all of them and verify each.
2. Current state (2026-05-19)
-
Branch:
ara-backend-design-prd -
Last clean commit:
f4a8d2a0("tests: golden-fixture regression set — 7 currently-correct corpus certs") -
301 tests passing
-
Parity probe (300 random certs from
data/ml_training/runs/2025_2026_n250000_v18a/data.parquet, seed=7,sap_score ∈ [5, 99]):Metric Value SAP MAE 4.61 SAP bias +0.87 PE MAE 43.32 kWh/m² PE bias +37.69 kWh/m² -
7 "golden" regression certs locked in
packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py. Tolerance:|SAP residual| ≤ 1,|PE residual| ≤ 10 kWh/m². Known caveat: some of these are compensating-error matches (e.g. cert7536-3827's PE matches but cost is £143 under cert's implied cost due to multi-factor offsetting bugs). These fixtures are retired per ADR-0010 and §10 below — they lock buggy compensating outputs in place and will fight the spec sweep.
Read this before anything else. ADR-0010 supersedes the spec-version target, the PCDB sequencing, and the cert-calibration layer of ADR-0009. This handover document was originally written under the rejected framing; §3, §4, §7, §7b, §10 below have been rewritten in lockstep. §2.5 lists the five prerequisites that land before the section-by-section sweep starts.
2.5. Prerequisites before the sweep starts
Five blockers, in dependency order. The section sweep does not start until all five are merged. Together they convert the parity probe from a noisy mixture-distribution signal into a clean per-section verification tool.
P1 — Re-extract the training parquet with inspection_date
The 250k-cert parquet has 202 columns; none of them are dates.
Without inspection_date on each cert we cannot construct the
Validation Cohort (P3). The ETL currently drops the dates; add them
back as a non-breaking MINOR Feature Schema Version bump (per
ADR-0008). EpcPropertyData.inspection_date and .registration_date
both exist on the domain object and are populated upstream — the
parquet writer just needs to include them.
P2 — Delete domain.sap.tables.table_12_cert_calibration; correct domain.sap.tables.table_12
Per ADR-0010 §2 and §1:
- Remove
table_12_cert_calibration.pyand every call site (cert_calibration_prices(),cert_calibration_e7_codes, thePriceTableconstructor argument that defaults to it). - Re-label
table_12.pyasSAP 10.2 Table 12 (14-03-2025 amendment). - Correct CO2 factors: mains gas 0.214 → 0.210, standard electricity 0.086 → 0.136 (the file currently mixes SAP 10.2 prices with SAP 10.3 CO2 factors).
- Delete the misleading "+25 % shift from SAP 10.2" comment — 13.19 p is SAP 10.1 (or SAP 10.2 amendment 0), not SAP 10.2 (14-03-2025).
P3 — Filter the parity probe to the Validation Cohort
Validation Cohort is defined in CONTEXT.md and ADR-0010 §3:
inspection_date ≥ 2025-07-01. Modify
services/ml_training_data/src/ml_training_data/sap_parity_probe.py
to apply the filter before sampling. The probe sample size and seed
remain configurable; sap_score ∈ [5, 99] remains the typicality
filter on top of the cohort filter.
P4 — Implement PcdbLookup (replace NoOpPcdbLookup)
Per ADR-0010 §4. Download boiler + heat-pump CSVs from
https://www.ncm-pcdb.org.uk. Build a lookup keyed on
main_heating_index_number. Surface seasonal efficiency, secondary
efficiency, output kW, and (for HPs) flow-temperature curve. ~half-day
of work per the original handover estimate. The
Sap10Calculator.__init__(pcdb: Optional[PcdbLookup]) seam from
ADR-0009 grill outcome #1 is the integration point; no calculator-side
changes needed beyond reading index_number and routing PCDB-returns
to space-heating / hot-water efficiency lookups instead of Table 4a.
P5 — Populate SapResult.intermediate + transcribe BRE worked examples
Per ADR-0010 "Verification infrastructure":
- Populate every named SAP 10.2 worksheet variable on
SapResult.intermediateas sketched in §11. This is mechanical — thread the values from each worksheet module into the dict. - Transcribe the BRE worked examples from the SAP 10.2 appendices and
RdSAP 10 worked-example annex into unit tests
(
tests/test_bre_worked_examples.py) that lock per-intermediate values, not aggregate SAP. These replace the retired cert fixtures.
P6 — Strict-type EpcPropertyData via canonical domain enums
The current EpcPropertyData and its nested types carry many bare
str fields and Union[int, str] fields (the latter because the
gov API gives ints and Site Notes give strings). The defensive
type-handling cascades into the calculator (cert_to_inputs.py,
dimensions.py, etc.) — dimensions.py:74-82 is Khalim's documented
example: SapBuildingPart.identifier carries main-vs-extension
information but is bare str, so the dimensions code defensively
iterates instead of dispatching on a typed kind.
The fix:
- One canonical enum per field, union of all keys appearing
across all schema versions in
datatypes/epc/domain/epc_codes.csv. Hand-author the 18 enum classes (built_form,construction_age_band,energy_tariff,glazed_area,glazed_type,heat_loss_corridor,main_fuel,mechanical_ventilation,property_type,tenure,transaction_type,ventilation_type,water_heating_fuel,cylinder_insulation_thickness,energy_efficiency_rating,improvement_description,improvement_summary,code) plusBuildingPartKind(Main Dwelling / Extension N). codes.csv is the reference; a dedup script can optionally verify coverage but is not a build dependency. - The API mapper parses raw ints into the canonical enum.
- The Site Notes mapper parses raw strings into the canonical enum.
- The domain object (
EpcPropertyDataand nested) holds only the canonical enums — noUnion[int, str], no barestrfor coded fields. - Every consumer (calculator, ML pipeline, recommendations, ETL, scenario builder) reads from the typed fields.
Constraint: repo-wide tests must keep passing. The calculator
is one consumer; the ML pipeline, recommendations, and the Site
Notes ingestion path also consume EpcPropertyData. Each mapper-
layer change is paired with adapter updates that preserve the
behaviour the existing tests cover.
Pyright strict mode must remain clean (CLAUDE.md).
Expected outcome of P1–P6
After all six land, run the probe against the Validation Cohort. The expected baseline MAE on the clean probe is much smaller than the current 4.61 — likely 1.5–2.5 SAP-points based on what we know about the residual breakdown (heat pumps closed by P4, gas boilers tightened by P4, price-version noise removed by P2+P3). The remaining residual is the genuine spec sweep target — and per-section fixes will move the probe in measurable, distinguishable amounts because there's no compensating layer to mask them, and there's no defensive type branching obscuring which input value drove which intermediate.
3. Why the prior diagnosis was wrong and how we fixed it
The prior session shipped ten slices (S-B23 → S-B31) by debugging the biggest residuals one at a time:
- PE MAE dropped substantially: 57.28 → 43.32 (−14) — real progress on the demand-side calculation.
- SAP MAE barely moved: 5.34 → 4.61 (−0.73) — diagnosed at the time as "cert-calibration absorbs multiple spec deviations".
Three slice attempts looked like they "proved" the cert-cal-absorbs- deviations diagnosis:
- Standing charges: spec Table 12 note (a) requires £92/yr gas standing charge on space + water heating. Adding it pushed SAP bias +0.98 → −2.62. Reverted.
- Cat=10 room heaters off-peak routing: Table 12a says "other direct-acting electric heating" bills 100 % high rate on 7-hour tariff. Switching cat=10 from off-peak to standard rate inverted the bias +5.88 → −6.00. Reverted.
- HW cylinder zero-loss for combi (uncommitted): Table 2 + Table 3 footers require zero storage + primary loss when efficiency comes from Table 4b. Zeroing them dropped PE MAE −6.64 but raised SAP MAE +0.39 and broke 3 of 7 golden fixtures. Reverted.
The prior agent concluded: cert-calibration absorbs Elmhurst's deviations from spec — we can't fix one without re-deriving the calibration, so do a full spec sweep first and re-derive cert-cal at the end. This diagnosis is wrong and the proposed remedy amplifies the problem.
What was actually going on
The 250k-cert corpus spans multiple SAP spec-version regimes:
- Pre-2025-03-14: certs lodged under SAP 10.1 / SAP 10.2 amendment 0 prices — mains gas ~3.48 p, standard electricity 13.19 p.
- Post-2025-03-14: certs lodged under SAP 10.2 (14-03-2025) prices — mains gas 3.64 p, standard electricity 16.49 p.
The table_12_cert_calibration prices (3.48 p / 13.19 p) are the
older spec's prices, not Elmhurst deviations from the spec. They
are an empirical "best fit" across a mixture distribution of two
price regimes, with downstream-component bugs (PCDB absence, HW
cylinder loss applied to combi, etc.) absorbed into the fit. The
table looks like compensation for assessor-software quirks because we
were never told which spec each cert was on.
Each "spec-correct fix that worsened MAE" in the failed slices above was actually correct. The MAE regressed because:
- The cert-cal prices (pre-March-2025 spec) cancelled with one set of downstream errors to produce a quasi-stable cost.
- The spec-correct fix landed → that cancellation broke → the probe MAE went up.
- But the spec-correct fix was right — what regressed was a compensating-error equilibrium, not the calculator's truth.
The prior session's "re-derive cert-cal at the end" plan would re-establish a new compensating-error equilibrium across the new bug set. It does not converge on spec-correctness.
The fix (per ADR-0010)
- Stop fitting against a mixture distribution. Filter the
validation corpus to a single spec-version window (Validation
Cohort,
inspection_date ≥ 2025-07-01). Every cert in the cohort was lodged on SAP 10.2 (14-03-2025) prices. - Delete the cert-calibration layer. Use spec prices everywhere
(
domain.sap.tables.table_12). The only price-routing decision left is Table 12a fractional high-rate blending — a real spec feature, not a calibration. - Build PCDB, because it dominates residual variance and the reason it was deferred (cert-cal-absorbs-PCDB) no longer holds.
- Build trace mode and BRE worked-example fixtures, so per-section verification works against single-cert intermediates instead of aggregate corpus MAE.
This is what §2.5 lists as the five prerequisites. Once they land, the section-by-section spec sweep produces clean, monotonic improvements.
4. Scope decisions (per ADR-0010)
IN scope
- SAP 10.2 (14-03-2025 amendment) is the active spec target.
docs/sap-spec/sap-10-2-full-specification-2025-03-14.pdf, 199 pages. - RdSAP 10 (10-06-2025) — the cert→input mapping layer that
cross-references SAP 10.2.
docs/sap-spec/rdsap-10-specification-2025-06-10.pdf, 114 pages. - PCDB integration. Moved from "Session C deferred" to P4
prerequisite (§2.5). Heat pumps and the 78 % of gas-boiler certs
lodging
main_heating_data_source=1need PCDB-sourced efficiency for the calculator to be spec-correct. Data source: https://www.ncm-pcdb.org.uk; lookup keyed onmain_heating_index_number; fields: seasonal efficiency, secondary efficiency, output kW, flow-temperature curve (HPs). - All RdSAP 10 sections in document order. §1 → §§19, plus Tables 27 / 28 / 29 / 30 / 31. The verification approach in §5 is unchanged — only the precondition changes: the sweep runs against a clean probe (Validation Cohort + spec prices + PCDB + trace mode).
OUT of scope
- Full SAP assessments. Full-SAP certs lodge measured/calculated
U-values in
walls[i].description(e.g. "Average thermal transmittance 0.18 W/m²K"). These are a separate calculation path (BS EN ISO 6946) and a different corpus. Park until the RdSAP 10 base case parity is reached. S-B24 / S-B29 attempted partial handling; those slices can stay or be reverted at your discretion when you reach §§4-7 of RdSAP and §3 of SAP 10.2. - SAP 10.3 (13-01-2026). No SAP-10.3-lodged certs in the corpus,
so it cannot be validated. Calculator targets SAP 10.2 until the
corpus migrates (expected late 2026 / 2027 once BRE updates RdSAP
to reference SAP 10.3). Note:
table_12.pycurrently mixes SAP 10.2 prices with SAP 10.3 CO2 factors — corrected as part of P2. - Historical-spec cert reproduction. Calculating what cert SAP would have been under SAP 10.1 / pre-March-2025 SAP 10.2 prices is not the calculator's job. Lodged Performance carries the historical value; Calculated SAP10 Performance is current-spec only. The Validation Cohort filter operationalises this — older certs are out of the validation loop, not because they're "wrong" but because they're a different spec's output.
- Re-deriving cert-cal at the end. The prior session's plan. The cert-calibration layer is deleted in P2, not re-fit.
5. The approach — section-by-section spec verification
Work through the RdSAP 10 spec in document order, starting at §1. For each section:
5.1. Read the spec section
Read the section text fully. Note every rule, table reference, and defaulting cascade.
5.2. Find the corresponding code
Map the section to the source file(s) implementing it. The current mapping (some sections are split across modules):
| RdSAP 10 section | Code location |
|---|---|
| §1 Introduction / general | n/a |
| §2 Property descriptors | datatypes/epc/domain/epc_property_data.py |
| §3 Dimensions | packages/domain/src/domain/sap/worksheet/dimensions.py |
| §4 Ventilation | packages/domain/src/domain/sap/worksheet/ventilation.py |
| §5 Construction / U-values | packages/domain/src/domain/ml/rdsap_uvalues.py + worksheet/heat_transmission.py |
| §6 Windows / doors / overshading | worksheet/solar_gains.py + rdsap/cert_to_inputs.py |
| §7 Heating systems (refers to SAP 10.2 Appendix A) | domain.ml.sap_efficiencies + rdsap/cert_to_inputs.py |
| §8 Heating controls (Table 4e) | rdsap/cert_to_inputs.py |
| §9 Heat emitters / flow temperatures | not implemented |
| §10 Space and water heating (Appendix A) | rdsap/cert_to_inputs.py |
| §11 Additional items (PV, batteries, wind, hydro, shutters) | partial in cert_to_inputs.py (PV only) |
| §12 Electricity tariff | rdsap/cert_to_inputs.py (_is_off_peak_meter, fuel routing) |
| §13 Addendum to EPCs | n/a |
| §14 Special cases (e.g. flats above commercial) | not implemented |
| §15 Improvements (recommendations) | n/a (not rating) |
| §16-19 RdSAP-specific SAP rating equations | worksheet/rating.py |
| Table 27 — Living-area fraction | rdsap/cert_to_inputs.py:_living_area_fraction |
| Table 28 — Cylinder size defaults | domain.ml.demand:_CYLINDER_VOLUME_L |
| Table 29 — Heating + HW parameters | partial in cert_to_inputs.py |
| Table 30 — Mechanical ventilation | not implemented |
| Table 31 — Data to be collected | n/a |
5.3. For each spec rule in the section, check our code
For each table, formula, footnote, exception:
- Does our code implement it?
- Does the implementation match the spec values exactly?
- Are there spec-defined edge cases / footnotes we're missing?
5.4. When a gap is found
- Write a failing unit test that asserts the spec-correct behaviour
— wherever possible, write it as an assertion on
intermediatevalues rather than on aggregate SAP, using a BRE worked example if one covers the section. - Implement the fix.
- Run
test_bre_worked_examples.pyplus the Validation Cohort probe. Note both direction and magnitude of change. - If a BRE worked-example breaks, the new code is wrong (revert). BRE examples are spec-derived and cannot regress from a spec-correct change.
- Commit per-slice: one section → one commit. Reference the spec section in the commit message.
5.5. Sweep-time principle: worksheet-faithful structure
Each worksheet/*.py module must mirror the SAP 10.2 worksheet
structure for its section. As you verify a section, also restructure
its module so that:
- Each function name references its worksheet-line origin (e.g.
heat_transfer_coefficientaligns with worksheet line (40);mean_internal_temperaturealigns with worksheet line (93)). - Compound calculations are split into one function per
worksheet line where possible — easier to verify against
intermediate[...]and against BRE worked-example values. - Defensive type-handling disappears. Once P6 lands, the input
is a typed enum or numeric — branching on
isinstance(x, int)is replaced by enum dispatch. - Domain-typed inputs flow directly.
SapBuildingPart.kind == BuildingPartKind.MAIN_DWELLINGreplaces string sniffing ofidentifier. The dimensions.py "unnecessarily complicated" pattern Khalim flagged is the canonical example of what not to do.
The principle applies during section-sweep slices. It is not a separate prerequisite — the refactor lands with the verification slice for the section it touches.
5.6. Use trace mode when you need it
P5 populates SapResult.intermediate: dict[str, float] with every
named SAP 10.2 worksheet variable. Each section's verification
benefits from inspecting these values per-cert. See §11 below for
the sketch.
6. What's already been done — section by section
This is your starting map. Each row says whether the section has been touched and what the current state is.
Walls / construction (§5)
- S-B23 (committed
9a509e41): Table 6 "Filled cavity" row dispatch whenwall_insulation_type=2ANDwall_construction=4. Spec-anchored. - S-B24 (committed
15613309): Parsewalls[i].descriptionfor "Average thermal transmittance X W/m²K". PARK — full-SAP path. - S-B25 (committed
6b934710): Description-based dispatch for cavity "as built, insulated (assumed)" + similar (type=4 with descriptive signal). Spec-anchored via legacyepc_wall_description_map. - S-B26 (committed
361f9154):_insulation_bucket(0, True) → 50fix (the "NI" thickness sentinel) + description-based override ofwall_ins_presentfor non-cavity walls. Spec footnote (Table 6). - S-B27 (committed
1f49fa03): Floor_insulation_bucketanalog — Table 19 footnote (2) "max(50, age-band default)" when description signals retrofit. - S-B28 (committed
25261d5c): Roof NI thickness + insulated description → §5.11.4 footnote 50mm joist row. - S-B29 (committed
3ab09845): Floor + roof "Average thermal transmittance" parse. PARK — full-SAP path.
Still to verify in §5:
- Stone wall U-values for Scotland / Wales / NIR (Tables 7-10) — only England is fully transcribed; country overrides are partial.
- Cob U-values (§5.6) — table only, no formula implementation.
- Stone formula §5.6 / §5.7 for non-standard wall thicknesses.
- Curtain wall §5.18 — not implemented.
- Party wall U-values (Table 15) — implemented in
u_party_wall, verify table values. - Thermal bridging (Table 21) — implemented as global
yfactor, verify per-age-band values. - §5.16 Thermal mass — Table 22 (only 100 / 250 kJ/m²K, dispatched
by construction type with internal insulation). Currently we
hardcode 250 (see
cert_to_inputs.py:_DEFAULT_THERMAL_MASS_PARAMETER_KJ_PER_M2_K). This is wrong for timber-frame / cob / internally-insulated masonry (should be 100).
Heating systems (§§7-10, SAP Appendix A)
- S-B20 (in history): Table 11 secondary heating allocation, conditional on cert lodging secondary or being electric storage.
- Failed S-B30 (reverted): respect
main_heating_fraction— shown empirically wrong. Field is multi-main allocation, not main-vs-secondary. Spec verified against SAP 10.2 Appendix A1/A4. - S-B31 (committed
afdf297f): Table 12c DLF on heat-network main. Spec §C3.1 + Table 12c. - Failed S-B32 (room heater off-peak routing, reverted): Table 12a says cat=10 room heaters on 7-hour tariff bill 100% high rate. Our cert-cal extends off-peak to codes 691-696. Spec-correct fix inverted bias direction — calibration was absorbing this.
- Uncommitted HW cylinder fix: spec-correct (combi → zero storage/primary loss per Table 2 + Table 3 footers) but breaks 3 golden fixtures. Decision deferred to systematic pass.
Still to verify in heating:
- Table 4a efficiency values for every code (heat pumps, storage heaters, room heaters, CPSU). The category-fallback (cat=4 → 2.30) is documented as a known limitation.
- Boiler interlock penalty (−5%) — spec §9.2.1: "The efficiency of gas and liquid fuel boilers for both space and water heating is reduced by 5% if the boiler is not interlocked for space and water heating." We don't apply this. Known gap.
- Table 4c condensing-boiler / heat-pump emitter-temperature adjustment — we don't apply this.
- Table 12a high-rate fractions for off-peak dwellings — we apply 100% off-peak or 100% standard, never fractional blending.
Hot water (§4 SAP + Appendix J)
- Storage loss factor table (Table 2) — current values in
domain.ml.demand:_STORAGE_LOSS_FACTORare ~3× off from spec (verified). Known under-prediction of cylinder loss for storage systems; cancelled by over-prediction of primary loss for combi systems in aggregate. - Primary loss formula (Table 3) — implemented as 245/60 kWh by age
band. Spec is a per-month formula
nₘ × 14 × [{0.0091·p + 0.0245·(1-p)}·h + 0.0263]withp(pipework insulation fraction) andh(circulation hours). Known approximation. - Combi-boiler zero-loss rule (Table 2 + Table 3 footers) — currently NOT applied (the failed uncommitted slice). Adding this drops PE MAE −6.64 but raises SAP MAE +0.39.
- Appendix J Vd formula
25N + 36— currently the simple form, not the full per-component (shower / bath / other) breakdown. Useful HW demand is ~7% under spec value. - ΔT — currently 43°C constant (55−12). Spec uses monthly Tcold and hot at 52°C, not 55°C. Per-month variance unmodelled.
Lighting (Appendix L)
predicted_lighting_kwhindomain.ml.demanduses9.3 × TFA × (1 − 0.5·led_share − 0.4·cfl_share)heuristic.- Spec is L1-L12: daylight correction, fixed-lighting capacity, top-up
- portable shares, monthly profile.
- For LED-dominant home (50+ LEDs): our heuristic gives ~465 kWh, spec gives ~94 kWh. Known over-prediction by ~5× for new-build LED homes.
Internal gains (§5 SAP)
worksheet/internal_gains.pyimplements metabolic + cooking + appliances + lighting (the four positive rows of Table 5).- Missing: Water heating row (
1000 × (65)ₘ / (nₘ × 24)— i.e. HW losses recycled as heated-space gains) and Losses row (−40 × Nfor cold inflow + evaporation). Both documented in S-B23 gap list.
Ventilation (§4 / Table 5)
- Wind-shelter factor implemented in S-B21.
- Mechanical ventilation (MVHR, MEV, PIV) — not implemented; cert rarely lodges. Spec §4.2 + Table 4g.
- Pressure-test override (worksheet lines 17-18) — not implemented.
Tariff / cost (§12 + Table 12 / 12a / 12c)
- Cert-calibration prices in
domain.sap.tables.table_12_cert_calibrationare an EMPIRICAL fit to Elmhurst's output. They are LOWER than the published Table 12 spec values by 4-25%. Known divergence; investigation deferred. - Standing charges (Table 12 note (a)) — NOT applied. Adding them empirically worsens MAE (calibration absorbs).
- Table 12a high-rate fractions — currently 100% off-peak for E7- eligible codes, 100% standard otherwise. No fractional blending.
- Heat network DLF (Table 12c) — applied per S-B31 only to main heating + HW from main. HW-only-from-heat-network is a separate slice.
7. The cert-calibration "tension" is dissolved (per ADR-0010)
This section originally framed cert-calibration vs spec-correctness as two end-states the calculator had to choose between. That framing is wrong (see §3 for the actual diagnosis): the cert-cal values are pre-March-2025 SAP prices, not Elmhurst deviations from SAP 10.2. Once the corpus is filtered to the Validation Cohort (P3) and the cert-cal layer is deleted (P2), the false dichotomy disappears.
What replaces this section
- One price table.
domain.sap.tables.table_12(re-labelled SAP 10.2 14-03-2025 amendment, CO2 factors corrected per P2). - One validation cohort.
inspection_date ≥ 2025-07-01, every cert lodged on the calculator's target spec version. - One verification mechanism. Trace-mode intermediates + BRE worked-example unit tests for per-section verification; Validation Cohort probe MAE for aggregate go/no-go.
Cert-software deviations from spec, if they exist at all, are expected to be small and localised. They surface as residual after the spec sweep completes against a clean probe — and at that point the question is whether to chase them at all (Elmhurst-deviation fixes have low domain value compared to spec-correctness, given the calculator's product use case is scoring counterfactuals for the MeasureApplicator chain, not reproducing historical certs).
7b. Outstanding findings to pick up during the systematic pass
The prior session identified several spec-correct fixes that were reverted because they made SAP MAE worse against the full corpus. The empirical signal that "reverted" them was version-mixture noise (see §3) plus compensating-error breakage in the 7 retired golden fixtures. Each fix below is expected to land cleanly once the five prerequisites in §2.5 are done, because:
- The Validation Cohort (P3) is on a single spec version — the price mismatch that drove the bias regression on standing charges and cat=10 routing disappears.
- The cert-cal layer is gone (P2) — no calibration to "break".
- PCDB is integrated (P4) — the heat-pump and gas-boiler residuals that dominated per-cert MAE collapse before any of these findings even matter.
- The fixtures are now BRE worked examples (P5 + §10) — they cannot be broken by spec-correct changes because they are themselves derived from the spec.
Treat each finding as a section-sweep TODO. The empirical impacts below were measured against the dirty probe (full corpus + cert-cal
- no PCDB) and are not predictive of behaviour on the clean probe. Re-measure each fix against the Validation Cohort after prerequisites land.
Finding 1 — HW cylinder zero-loss rule for combi boilers
Status: spec-correct fix exists in working-tree-only form (uncommitted). Reverted at end of last session.
Spec basis:
- SAP 10.2 Table 2 footer (page 158): "In the case of a combination boiler: a) the storage loss factor is zero if the efficiency is taken from Table 4b"
- SAP 10.2 Table 3 footer (page 160): "Primary loss is set to zero for the following: Electric immersion heater, Combi boiler (including when it is part of a combined heat pump and boiler package and provides all the hot water), CPSU (including electric CPSU), Boiler and thermal store within a single casing, Separate boiler and thermal store connected by no more than 1.5 m of insulated pipework, Direct-acting electric boiler, Heat pump (not combined heat pump and boiler package with a non-combi boiler) from PCDB with hot water vessel integral to package"
The bug: our calculator currently adds storage loss (~135 kWh)
and primary loss (~245 kWh) for ALL certs with an age band lodged,
ignoring whether the dwelling has a cylinder. 67% of corpus certs
explicitly lodge has_hot_water_cylinder=False (the modal combi
boiler case) — we add 380 kWh of fictional HW losses for each.
The fix (sketch, ~10 lines):
- Add
has_cylinder: bool = Truekeyword topredicted_hot_water_kwhinpackages/domain/src/domain/ml/demand.py. - When
has_cylinder=False, setstorage_loss = 0andprimary_loss = 0. - In
cert_to_inputs.py(around line 829), passhas_cylinder=epc.has_hot_water_cylinder and not is_instantaneous.
Empirical impact (measured on 300-cert probe):
- PE MAE: 43.32 → 36.68 (−6.64) ← biggest single fix found this session
- PE bias: 37.69 → 30.41 (−7.28)
- SAP MAE: 4.61 → 5.00 (+0.39, regression)
- 3 of 7 golden fixtures break
Why it was reverted: the SAP regression + broken fixtures indicate the fictional HW losses were partially compensating for OTHER bugs (likely lighting over-prediction for LED-dominant homes). The right ordering is: fix the spec-clear cases (HW cylinder, lighting per Appendix L, etc.) together, then re-derive cert-cal.
When to pick up: when you reach §4 / Appendix J during the systematic pass. Pair with the lighting Appendix L fix to avoid breaking the golden fixtures individually.
Finding 2 — Standing charges (Table 12 note (a))
Status: spec-correct, never implemented. Empirically rejected by 4-mode probe.
Spec basis: SAP 10.2 Table 12 note (a), page 190:
"For calculations including regulated energy uses only (e.g. regulation compliance, energy ratings):
- The standing charge for electricity standard tariff is omitted
- The standing charge for off-peak electricity is added to space and water heating costs where either main heating or hot water uses off-peak electricity
- The standing charge for gas fuels is added to space and water heating costs where the gas fuel is used for space heating (main or secondary) or for water heating"
The bug: our calculator never adds standing charges. Per spec, a gas-heated dwelling should have £92/yr added to the ECF numerator.
Empirical impact (4-mode probe, 300 certs):
| Mode | All certs | Gas-only |
|---|---|---|
| cert-cal, no standing (current) | MAE 4.69, bias +0.98 | MAE 4.01, bias +0.80 |
| cert-cal + gas standing | MAE 4.94, bias −2.62 | MAE 4.31, bias −3.53 |
Adding standing charges shifts SAP bias by ~3.5 points downward — clearly the wrong direction. The cert-cal prices (3.48p gas vs spec 3.64p) implicitly absorb the standing-charge contribution.
When to pick up: when you reach §12 / Table 12. Apply alongside spec-correct unit prices (3.64p gas, 16.49p elec) and re-derive cert-cal to match Elmhurst's residual deviation pattern.
Finding 3 — Cat=10 room heaters off-peak routing
Status: spec-correct, currently bills room heaters at off-peak rate on E7 dwellings. Empirically rejected.
Spec basis: SAP 10.2 Table 12a (page 191):
"Other direct-acting electric heating (including electric secondary heating): 7-hour tariff 1.00 high rate; 10-hour tariff 0.50 high rate"
The bug: our cert-calibration (cert_calibration_e7_codes)
extends the off-peak set to include codes 691-696 (room heaters).
That's the S-B14 empirical extension — the previous agent found it
helped some specific certs. Per Table 12a it's WRONG: room heaters
on E7 should bill 100% at HIGH rate, not at low rate.
Empirical impact: switching from off-peak (5.50p cert-cal) to standard rate (13.19p) — closer to spec but still not the high rate (15.29p cert-cal) — inverted the bias from +5.88 to −6.00 without improving MAE.
The real issue: Table 12a defines FRACTIONAL blending (e.g. "90% high, 10% low" for direct-acting electric boiler on 7-hour tariff), not binary on/off-peak. Our calculator only supports binary. A proper implementation needs per-system high-rate fractions.
When to pick up: when you reach §12 / Table 12a. Implement fractional blending for all the rows of Table 12a, not just cat=10.
Finding 4 — Lighting (Appendix L proper)
Status: gap. Current code uses a 9.3 kWh/m² heuristic with simple LED/CFL reductions; spec is the L1-L12 cascade with daylight correction, fixed-lighting capacity, top-up + portable shares, monthly profile.
Spec basis: SAP 10.2 Appendix L §L1 (pages 88-90), equations L1-L12.
The bug: for a 100 m² LED-dominant home (e.g. cert 7536-3827 with 51 LEDs), our heuristic returns 465 kWh/yr; spec returns ~94 kWh/yr. Over-prediction by ~5× on LED-dominant homes (which is most modern stock).
Empirical impact (estimated):
- ~5-6 kWh/m² PEUI over-prediction for LED-dominant population
- Corpus-weighted: ~3-4 kWh/m² PEUI bias contribution
When to pick up: when you reach Appendix L. Pair with the HW cylinder fix (Finding 1) to avoid the SAP MAE regression.
Finding 5 — Internal-gains Table 5 missing rows
Status: gap. Spec Table 5 has 7 rows for internal gains; our
worksheet/internal_gains.py implements 4.
Spec basis: SAP 10.2 Table 5 (page 177).
Missing rows:
- Water heating:
1000 × (65)ₘ / (nₘ × 24)W — the HW losses (cylinder + distribution + primary) recycled as heated-space gains via worksheet line (65). Reduces space heating demand. - Losses:
−40 × NW — heat to incoming cold water and evaporation. Negative contribution.
Empirical impact (estimated):
- For N=2.7: HW gains ≈+75 W, losses ≈−108 W, net ≈−33 W. Currently we miss both → our gains are 33 W too high → space heating demand too low → PE under-predicted by ~3 kWh/m² (rough).
When to pick up: when you reach §5 / Table 5. Worksheet line (65)
also needs implementation — the HW losses already exist in our calc
(see demand.py:_cylinder_storage_loss_kwh etc.), they just need
piping into internal_gains.
Finding 6 — Storage-loss-factor table values are wrong
Status: gap. Affects only certs with has_hot_water_cylinder=True
(33% of corpus).
Spec basis: SAP 10.2 Table 2 (page 158).
The bug: domain.ml.demand:_STORAGE_LOSS_FACTOR values are ~3×
LOWER than spec. E.g. for 38mm foam our value is 0.0056, spec is
0.0181. Effect: we UNDER-predict cylinder storage loss by ~300 kWh
for storage systems, partly cancelling the over-prediction from
Finding 1.
When to pick up: when you reach §4 / Table 2. Fix WITH Finding 1 (combi zero-loss) so the cancellation doesn't dominate the direction.
Finding 7 — Heat-pump fallback efficiency 2.30
Status: gap that requires PCDB. See §8b.
Finding 8 — Other smaller gaps (carry forward)
- Boiler interlock −5% penalty (§9.2.1) — never applied
- Table 4c condensing boiler / HP emitter temperature adjustment — never applied
- Control-temperature adjustment from Table 4e — always 0 in code, spec varies
- Wall U-values for Scotland / Wales / NIR — only England fully transcribed
- Per-junction thermal bridging (Table R2) — global y approximation only
- Multi-main heating (
main_heating_fraction≠ 1) — first main only - Cooling §10 — not implemented (rare in UK)
- FEE §11 — not implemented (new-build only)
8. Don't repeat — known dead-ends
Re-read after §3 + §7b. Three entries below were classified as "dead-ends because cert-cal absorbs" — that diagnosis is wrong. They are spec-correct fixes that were measured under a noisy probe. Now flagged as conditional dead-ends: dead only if you try them before P1–P5 land. After prerequisites: they are expected improvements, not dead-ends. See ADR-0010.
- ❌ Switching "NI" wall thickness to None alone (S-B5 in history) —
over-corrected because it routed to the (Unfilled cavity, 50mm) row
instead of the dedicated Filled cavity row. The right fix landed in
S-B23 with a
WALL_INSULATION_FILLED_CAVITYdispatcher. - ❌ Aggressive efficiency rescue for missing
sap_main_heating_code(S-B5) — over-corrected. The category fallback (cat=4 → 2.30) is intentionally conservative; PCDB (P4 prerequisite) supplies the real efficiency. - ⚠️ Using SAP 10.2 spec prices for parity validation — under the dirty probe, cert-cal prices fit better. Inverts under the clean probe (P2 + P3): SAP 10.2 spec prices are correct because the Validation Cohort is on the 14-03-2025 amendment. Listed here only as a warning if you start the sweep before prerequisites land.
- ❌ Always applying 10% secondary heating — must be conditional on cert lodging or main system being electric storage (S-B20). See spec Appendix A.4.
- ❌ Respecting
main_heating_fractionfor secondary allocation (failed S-B30) — the field is the multi-main allocation (system 1 vs system 2), not main-vs-secondary. SAP MAE 4.69 → 4.85 (worse). - ⚠️ Switching cat=10 room heaters off off-peak (failed S-B32) — spec-correct per Table 12a. The bias inversion under the dirty probe was driven by cert-cal compensating; on the clean probe this is just spec-correct. Land as part of the §12 spec sweep after prerequisites.
- ⚠️ Adding gas standing charges (4-mode probe, unimplemented) — spec-correct per Table 12 note (a). Same logic: bias drift under dirty probe is version-mixture + missing-PCDB noise, not Elmhurst deviation. Land as part of §12 spec sweep.
- ⚠️ Zeroing storage + primary loss for combi boilers (uncommitted S-B32) — spec-correct per Table 2 + Table 3 footers. SAP MAE regression was driven by the now-retired golden fixtures (§10) and cert-cal absorption. Land as part of §4 / Appendix J sweep.
9. The cert corpus and parity probe
Sample
data/ml_training/runs/2025_2026_n250000_v18a/data.parquet is the
250k-cert parquet. After P1 lands the parquet carries
inspection_date; the probe then filters to the Validation Cohort
(inspection_date ≥ 2025-07-01) plus sap_score ∈ [5, 99] and
samples 300 at seed=7 by default. Filtering rationale:
- ≤ 5 is heritage/anomaly stock (sub-3 % of corpus)
- ≥ 99 is full-SAP new-builds the parquet excludes anyway
inspection_date ≥ 2025-07-01ensures every cert was lodged on SAP 10.2 (14-03-2025 amendment) — see CONTEXT.md / "Validation Cohort" and ADR-0010 §3.
Run the probe
python -c "
import sys
sys.path.insert(0, 'packages/domain/src')
sys.path.insert(0, '.')
sys.path.insert(0, 'services/ml_training_data/src')
from ml_training_data.sap_parity_probe import main
main(['300','7'])
"
What the probe shows
- Aggregate SAP MAE / RMSE / bias
- Aggregate PE MAE / RMSE / bias
- Per-end-use PEUI breakdown (space / HW / lighting / pumps)
- Stratification by
main_heating_category,construction_age_band,dwelling_type - Worst-15 residuals (SAP and PE)
Known parquet limitations
- ~0.7% of parquet certs have
construction_age_band=Nonevs 15% in the raw bulk-zip. The parquet filters out full-SAP new-builds upstream. Don't measure full-SAP-path slices against the parquet. - Heat-pump certs (cat=4) are under-represented and concentrated in the worst-residual tail because PCDB efficiency is unavailable.
10. Fixtures: retire the 7 cert-based golden fixtures, replace with BRE worked examples (per ADR-0010 + P5)
The 7 cert-based fixtures at
packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py
were locked in against the current calculator state — with cert-cal,
without PCDB, with HW cylinder loss always applied, with the
lighting heuristic, etc. They are documented in §3 / the prior
handover as containing compensating errors. Once the prerequisites
land, every spec-correct fix breaks at least one of them. They will
fight the spec sweep.
Replacement strategy
Primary regression suite: BRE worked-example fixtures.
Transcribe the worked examples from:
- SAP 10.2 spec appendices (especially Appendix R — reference values and the worked example dwelling).
- RdSAP 10 (10-06-2025) worked-example annex.
Each worked example becomes a unit test that locks per-intermediate expected values (HLP, HTC, mean internal temperature monthly, MIT, ECF, SAP score) rather than the aggregate SAP score alone. Because they are spec-derived, no spec-correct change can break them — any break is an implementation bug, unambiguously.
These tests live at
packages/domain/src/domain/sap/tests/test_bre_worked_examples.py
(new module — separate from the cert-based fixtures module).
Cert-based fixtures retired.
The current test_golden_fixtures.py is either deleted or repurposed
as a very loose smoke-test integration suite (e.g. |SAP residual| ≤ 5) that catches catastrophic regressions only. The 7 cert JSONs
under fixtures/golden/<cert>.json can be kept on disk as reference
data, but they no longer drive go/no-go decisions in the sweep.
Optional future addition.
If/when a current Elmhurst (or Stroma / Quidos / NHER) license is available, run a handful of representative corpus certs through it and lock those outputs as a second-tier regression suite — Elmhurst- parity fixtures alongside spec-parity fixtures. Not a prerequisite.
11. Trace mode (prerequisite P5 — implementation sketch)
This section was originally labelled "recommended"; it is now
prerequisite P5 per ADR-0010. The sweep does not start until
intermediate is populated everywhere. ADR-0009 proposed:
@dataclass(frozen=True)
class SapResult:
sap_score: float
...
intermediate: dict[str, float]
The intermediate field was never populated. Suggested implementation
for the systematic pass:
intermediate = {
# §1 dimensions
"tfa_m2": tfa,
"volume_m3": volume,
"storey_count": storeys,
# §3 heat transmission
"walls_w_per_k": ht.walls_w_per_k,
"roof_w_per_k": ht.roof_w_per_k,
"floor_w_per_k": ht.floor_w_per_k,
"party_walls_w_per_k": ht.party_walls_w_per_k,
"windows_w_per_k": ht.windows_w_per_k,
"doors_w_per_k": ht.doors_w_per_k,
"thermal_bridging_w_per_k": ht.thermal_bridging_w_per_k,
"infiltration_ach": infiltration,
"infiltration_w_per_k": infiltration * volume * 0.33,
"heat_transfer_coefficient_w_per_k": hlc,
"heat_loss_parameter_w_per_m2k": hlp,
"time_constant_h": tau_h,
# §5 internal gains (annual averages)
"internal_gains_annual_avg_w": ...,
# §7 mean internal temperature (annual avg)
"mean_internal_temp_annual_avg_c": ...,
# §9 space heating
"useful_space_heating_kwh_per_yr": space_heating_kwh,
# §12 fuel costs (per end-use)
"main_heating_cost_gbp": ...,
"hot_water_cost_gbp": ...,
"lighting_cost_gbp": ...,
"pumps_fans_cost_gbp": ...,
# §13 rating
"ecf": ecf,
"deflator": 0.36,
# §14 primary energy and CO2 per end-use
"space_heating_pe_kwh_per_m2": ...,
"hot_water_pe_kwh_per_m2": ...,
...
}
Once populated, the differential debugging the reviewer recommended becomes possible: change one input field, compare deltas against an Elmhurst export.
12. Specific section-1 starting tasks (suggested first session)
A concrete pickup point:
Session 1 — §1 (Introduction), §2 (Property Descriptors), §3 (Dimensions)
- §1 is prose; nothing to verify.
- §2 maps to
EpcPropertyData. Verify that every field RdSAP §2 enumerates is present and correctly typed on the domain object. Specifically check:dwelling_type,built_form,property_type,construction_age_band,country_code. Note thatconstruction_age_bandis per-building-part, not dwelling-level, and the primary age band drives most defaults. - §3 maps to
worksheet/dimensions.py. Verify:- Total floor area sum across building parts equals TFA
- Volume calculation per storey × area × height
- Storey count handling for extensions and room-in-roof
- Multi-storey heat-loss-perimeter rules
This single session should produce zero behaviour changes if §1-3 are correctly implemented, but expect to find at least one issue in §3 geometry (per the reviewer's "biggest SAP error sources" list).
Important: Session 1 only starts after all five prerequisites in §2.5 have landed and the Validation Cohort probe baseline has been captured. Until then, running per-section verification produces noisy signal.
Run the BRE worked-example fixtures (P5) + Validation Cohort probe (P3) at the end of each session; expect no movement until you start hitting actual gaps.
13. Workflow recap
Phase 0 — Prerequisites (§2.5). Land P1–P6 first, in dependency order:
| Slice | Depends on | |
|---|---|---|
| P1 | Re-extract parquet with inspection_date |
— |
| P2 | Delete cert-cal; correct table_12.py CO2 factors |
— |
| P3 | Filter parity probe to Validation Cohort | P1 |
| P4 | Implement PcdbLookup |
— (P2 helpful) |
| P5 | Populate SapResult.intermediate + transcribe BRE worked examples |
— |
| P6 | Strict-type EpcPropertyData via codes.csv-derived enums |
— |
P1, P2, P4, P5, P6 can run in parallel. P3 needs P1. Capture a Validation Cohort probe baseline once all six land — that is the new MAE starting line. Repo-wide tests stay green throughout P6 (Site Notes consumers, ML pipeline, recommendations, etc. all need the mapper updates that accompany each typing change).
Phase 1 — Section sweep. For each RdSAP 10 section, in document order:
- Read the spec section text + cited tables.
- Identify code location(s).
- For each rule / table / footnote:
- Does our code implement it?
- Does the implementation match?
- Edge cases / fallback paths handled?
- For each gap: AAA unit test (preferring a BRE worked-example
assertion on
intermediatevalues when possible) → minimal implementation → commit. - Apply the worksheet-faithful structure principle (§5.5) as part of this slice: name functions after worksheet lines, split compound calculations, replace any remaining defensive type-handling with typed-enum dispatch.
- After each commit: run
test_bre_worked_examples.py+ Validation Cohort probe. Note both deltas in the commit message. - If a BRE worked-example breaks: the new code is wrong (revert). The worked examples are spec-derived and cannot be broken by spec-correct changes.
Stick to this. The prior session's mistake was jumping between sections based on residual-size on a dirty probe. Clean probe plus document-order discipline plus worksheet-faithful structure is what makes the sweep converge.
14. Useful references
- ADR-0010
docs/adr/0010-sap10-calculator-spec-target-and-validation.md— the binding decisions reflected in this rewrite: SAP 10.2 target, cert-cal deletion, Validation Cohort, PCDB-as-prerequisite, fixture retirement. Read first. - ADR-0009
docs/adr/0009-deterministic-sap-calculator.md— original calculator decision rationale + Session A/B/C plan. Read for context; spec-version target / PCDB sequencing / cert-cal rationale are superseded by ADR-0010. - Spec coverage map
docs/sap-spec/SPEC_COVERAGE.md— pre-existing coverage tracker. Update as you go. - Parity findings
docs/sap-spec/PARITY_FINDINGS.md— empirical findings from prior sessions. - Earlier handover
docs/sap-spec/HANDOVER_FRESH_REVIEW.md— orientation from the previous fresh-context pass. - Reviewer feedback (informal) — chatGPT critique of the slice-by- slice approach. Key recommendations: two-layer architecture (RdSAP expansion → SAP worksheet), trace mode, golden-master methodology, differential debugging, reference traces from Elmhurst/Stroma/Quidos.
- Commit log —
git log --onelineshows the slice history; each S-Bxx commit message documents the spec ref + measured impact.
15. Final note
The prior session's framing — "the cert-calibration layer absorbs Elmhurst's spec deviations; we'll re-derive it at the end" — was load-bearing on a false diagnosis. The cert-cal layer is pre-March-2025 SAP prices fit against a mixture distribution of two spec-version regimes. Once you separate the regimes (Validation Cohort) and use spec prices everywhere, the "tension" disappears.
After P1–P5 land, the section sweep is straightforward: every spec-correct fix is unambiguously the right answer, BRE worked-example fixtures lock the result, and Validation Cohort probe MAE moves monotonically downward. The fixes the prior session marked as "spec-correct but probe-regressed" become trivially landable.
Welcome to the project. Read ADR-0010, land the five prerequisites, then walk the spec in document order. The deterministic answer is in there.