Model/docs/sap-spec/HANDOVER_SYSTEMATIC_REVIEW.md
Khalim Conn-Kowlessar bb9c5ac017 docs: ADR-0010 retargets calculator to SAP 10.2; rewrite handover
Adds ADR-0010 superseding ADR-0009's spec-version target, PCDB
sequencing, and cert-calibration layer. Captures the conclusions
of a grill-with-docs session:

  1. Active spec target is SAP 10.2 (14-03-2025), not SAP 10.3 — no
     SAP-10.3-lodged certs exist in the corpus to validate against.
  2. table_12_cert_calibration is deleted (not "re-derived at the
     end"). It was pre-March-2025 spec prices fit against a mixture
     distribution of two spec-version regimes, with downstream-
     component bugs absorbed into the fit — not Elmhurst deviation.
  3. Validation Cohort: filter the corpus to inspection_date ≥
     2025-07-01 so every cert in the probe was lodged on SAP 10.2
     (14-03-2025) prices. One spec, one signal.
  4. PCDB integration is promoted from "Session C deferred" to
     prerequisite P4 — dominates residual variance on heat pumps and
     the 78% of gas-boiler certs lodging main_heating_data_source=1.
  5. Trace mode (SapResult.intermediate) and BRE worked-example
     fixtures replace the 7 cert-based golden fixtures, which
     contained compensating errors.
  6. Strict-type EpcPropertyData via codes.csv-derived canonical
     enums (P6) — the in-source motivation lives at
     dimensions.py:74-82 (Khalim's comment, included in this commit).
  7. Worksheet-faithful structure is a sweep-time principle: each
     worksheet module mirrors SAP 10.2 worksheet line numbering.

CONTEXT.md additions:
  - Refined "Calculated SAP10 Performance" and "SAP10 Calculation"
    to reference SAP 10.2 + ADR-0010.
  - New term "SAP Spec Version" — domain-meaningful because the
    same EpcPropertyData yields different sap_score under different
    spec revisions.
  - New term "Validation Cohort" — the version-locked sub-corpus.

HANDOVER_SYSTEMATIC_REVIEW.md is rewritten section-by-section to
reflect ADR-0010: §1 framing, §2 status pointer, new §2.5 with the
six prerequisites P1–P6 in dependency order, §3 diagnosis (cert-cal
was stale prices, not Elmhurst deviation), §4 scope (PCDB IN,
SAP 10.3 stays OUT), §5 approach (worksheet-faithful principle as
§5.5), §7 tension dissolved, §7b findings re-framed, §8 dead-ends
re-classified as conditional, §9 cohort filter, §10 fixture
strategy, §11 trace mode as prerequisite, §12 prereqs-first,
§13 Phase 0/Phase 1 workflow, §14 ADR-0010 reference, §15 final
note.

P2.1 (commit ac1aa56a) already lands the first ADR-0010 slice
(probe swap to spec prices).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-19 09:54:24 +00:00

1092 lines
50 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Handover — Systematic Section-by-Section RdSAP 10 / SAP 10.2 Review
**Audience:** A fresh agent picking up the deterministic SAP calculator at
`packages/domain/src/domain/sap/`. Read this first, then the spec PDFs,
then the code.
**Goal:** Match the cert software (Elmhurst / Stroma / etc.) output exactly
for RdSAP 10 / SAP 10.2 input certs. This is a **deterministic, mechanical
calculation** — not a model — so MAE should approach zero on certs whose
inputs are fully populated.
---
## 1. Critical framing — this is NOT a judgement call
The SAP/RdSAP energy assessment splits cleanly into two roles:
1. **The assessor** — a person who surveys the dwelling and lodges
measured/observed fields onto the cert (areas, perimeters,
construction codes, insulation thicknesses, fuel types, etc.).
The assessor makes NO calculation decisions.
2. **The cert software** (Elmhurst, Stroma, Quidos, NHER, ECMK) — a
deterministic implementation of the RdSAP 10 + SAP 10.2 specs. It
takes the lodged fields and produces SAP score, CO2 emissions,
primary energy (PEUI), CO2 per m², EI rating, etc.
**Our calculator is replicating role #2.** Assessor software
implements the SAP 10.2 spec faithfully; the question of "where does
Elmhurst diverge from spec?" is no longer the operative one (per
ADR-0010 + §3 below). Our job is to enumerate every spec
table / formula / footnote and verify each against the published SAP
10.2 (14-03-2025) and RdSAP 10 (10-06-2025) PDFs.
There is no "assessor judgement" knob to tune. Each field on the cert
has a deterministic interpretation per the spec. Each spec table /
formula has a deterministic implementation. Our job is to enumerate
all of them and verify each.
---
## 2. Current state (2026-05-19)
- Branch: `ara-backend-design-prd`
- Last clean commit: `f4a8d2a0` ("tests: golden-fixture regression set — 7 currently-correct corpus certs")
- 301 tests passing
- Parity probe (300 random certs from
`data/ml_training/runs/2025_2026_n250000_v18a/data.parquet`, seed=7,
`sap_score ∈ [5, 99]`):
| Metric | Value |
|---|---|
| SAP MAE | 4.61 |
| SAP bias | +0.87 |
| PE MAE | 43.32 kWh/m² |
| PE bias | +37.69 kWh/m² |
- 7 "golden" regression certs locked in
`packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py`.
Tolerance: `|SAP residual| ≤ 1`, `|PE residual| ≤ 10 kWh/m²`. Known
caveat: some of these are compensating-error matches (e.g. cert
`7536-3827`'s PE matches but cost is £143 under cert's implied cost
due to multi-factor offsetting bugs). **These fixtures are retired
per ADR-0010 and §10 below — they lock buggy compensating outputs
in place and will fight the spec sweep.**
> **Read this before anything else.** [ADR-0010](../adr/0010-sap10-calculator-spec-target-and-validation.md)
> supersedes the spec-version target, the PCDB sequencing, and the
> cert-calibration layer of ADR-0009. This handover document was
> originally written under the rejected framing; §3, §4, §7, §7b,
> §10 below have been rewritten in lockstep. §2.5 lists the five
> prerequisites that land **before** the section-by-section sweep
> starts.
---
## 2.5. Prerequisites before the sweep starts
Five blockers, in dependency order. The section sweep does not start
until all five are merged. Together they convert the parity probe
from a noisy mixture-distribution signal into a clean per-section
verification tool.
### P1 — Re-extract the training parquet with `inspection_date`
The 250k-cert parquet has 202 columns; **none of them are dates**.
Without `inspection_date` on each cert we cannot construct the
Validation Cohort (P3). The ETL currently drops the dates; add them
back as a non-breaking MINOR Feature Schema Version bump (per
ADR-0008). `EpcPropertyData.inspection_date` and `.registration_date`
both exist on the domain object and are populated upstream — the
parquet writer just needs to include them.
### P2 — Delete `domain.sap.tables.table_12_cert_calibration`; correct `domain.sap.tables.table_12`
Per ADR-0010 §2 and §1:
- Remove `table_12_cert_calibration.py` and every call site
(`cert_calibration_prices()`, `cert_calibration_e7_codes`, the
`PriceTable` constructor argument that defaults to it).
- Re-label `table_12.py` as `SAP 10.2 Table 12 (14-03-2025 amendment)`.
- Correct CO2 factors: mains gas 0.214 → **0.210**, standard electricity 0.086 → **0.136** (the file currently mixes SAP 10.2 prices with SAP 10.3 CO2 factors).
- Delete the misleading "+25 % shift from SAP 10.2" comment — 13.19 p
is SAP 10.1 (or SAP 10.2 amendment 0), not SAP 10.2 (14-03-2025).
### P3 — Filter the parity probe to the Validation Cohort
`Validation Cohort` is defined in `CONTEXT.md` and ADR-0010 §3:
`inspection_date ≥ 2025-07-01`. Modify
`services/ml_training_data/src/ml_training_data/sap_parity_probe.py`
to apply the filter before sampling. The probe sample size and seed
remain configurable; `sap_score ∈ [5, 99]` remains the typicality
filter on top of the cohort filter.
### P4 — Implement `PcdbLookup` (replace `NoOpPcdbLookup`)
Per ADR-0010 §4. Download boiler + heat-pump CSVs from
https://www.ncm-pcdb.org.uk. Build a lookup keyed on
`main_heating_index_number`. Surface seasonal efficiency, secondary
efficiency, output kW, and (for HPs) flow-temperature curve. ~half-day
of work per the original handover estimate. The
`Sap10Calculator.__init__(pcdb: Optional[PcdbLookup])` seam from
ADR-0009 grill outcome #1 is the integration point; no calculator-side
changes needed beyond reading `index_number` and routing PCDB-returns
to space-heating / hot-water efficiency lookups instead of Table 4a.
### P5 — Populate `SapResult.intermediate` + transcribe BRE worked examples
Per ADR-0010 "Verification infrastructure":
- Populate every named SAP 10.2 worksheet variable on
`SapResult.intermediate` as sketched in §11. This is mechanical —
thread the values from each worksheet module into the dict.
- Transcribe the BRE worked examples from the SAP 10.2 appendices and
RdSAP 10 worked-example annex into unit tests
(`tests/test_bre_worked_examples.py`) that lock per-intermediate
values, not aggregate SAP. These replace the retired cert fixtures.
### P6 — Strict-type `EpcPropertyData` via canonical domain enums
The current `EpcPropertyData` and its nested types carry many bare
`str` fields and `Union[int, str]` fields (the latter because the
gov API gives ints and Site Notes give strings). The defensive
type-handling cascades into the calculator (`cert_to_inputs.py`,
`dimensions.py`, etc.) — `dimensions.py:74-82` is Khalim's documented
example: `SapBuildingPart.identifier` carries main-vs-extension
information but is bare `str`, so the dimensions code defensively
iterates instead of dispatching on a typed kind.
The fix:
1. **One canonical enum per field**, union of all keys appearing
across all schema versions in
`datatypes/epc/domain/epc_codes.csv`. Hand-author the 18 enum
classes (`built_form`, `construction_age_band`, `energy_tariff`,
`glazed_area`, `glazed_type`, `heat_loss_corridor`, `main_fuel`,
`mechanical_ventilation`, `property_type`, `tenure`,
`transaction_type`, `ventilation_type`, `water_heating_fuel`,
`cylinder_insulation_thickness`, `energy_efficiency_rating`,
`improvement_description`, `improvement_summary`, `code`) plus
`BuildingPartKind` (Main Dwelling / Extension N). codes.csv is
the reference; a dedup script can optionally verify coverage but
is not a build dependency.
2. **The API mapper** parses raw ints into the canonical enum.
3. **The Site Notes mapper** parses raw strings into the canonical
enum.
4. **The domain object** (`EpcPropertyData` and nested) holds only
the canonical enums — no `Union[int, str]`, no bare `str` for
coded fields.
5. **Every consumer** (calculator, ML pipeline, recommendations,
ETL, scenario builder) reads from the typed fields.
**Constraint**: repo-wide tests must keep passing. The calculator
is one consumer; the ML pipeline, recommendations, and the Site
Notes ingestion path also consume `EpcPropertyData`. Each mapper-
layer change is paired with adapter updates that preserve the
behaviour the existing tests cover.
Pyright `strict` mode must remain clean (CLAUDE.md).
### Expected outcome of P1P6
After all six land, run the probe against the Validation Cohort. The
expected baseline MAE on the clean probe is much smaller than the
current 4.61 — likely 1.52.5 SAP-points based on what we know about
the residual breakdown (heat pumps closed by P4, gas boilers tightened
by P4, price-version noise removed by P2+P3). The remaining residual
is the genuine spec sweep target — and per-section fixes will move
the probe in measurable, distinguishable amounts because there's no
compensating layer to mask them, and there's no defensive type
branching obscuring which input value drove which intermediate.
---
## 3. Why the prior diagnosis was wrong and how we fixed it
The prior session shipped ten slices (S-B23 → S-B31) by debugging the
biggest residuals one at a time:
- **PE MAE dropped substantially: 57.28 → 43.32 (14)** — real progress
on the demand-side calculation.
- **SAP MAE barely moved: 5.34 → 4.61 (0.73)** — diagnosed at the time
as "cert-calibration absorbs multiple spec deviations".
Three slice attempts looked like they "proved" the cert-cal-absorbs-
deviations diagnosis:
- **Standing charges**: spec Table 12 note (a) requires £92/yr gas
standing charge on space + water heating. Adding it pushed SAP bias
+0.98 → 2.62. Reverted.
- **Cat=10 room heaters off-peak routing**: Table 12a says "other
direct-acting electric heating" bills 100 % high rate on 7-hour
tariff. Switching cat=10 from off-peak to standard rate inverted
the bias +5.88 → 6.00. Reverted.
- **HW cylinder zero-loss for combi** (uncommitted): Table 2 + Table
3 footers require zero storage + primary loss when efficiency comes
from Table 4b. Zeroing them dropped PE MAE 6.64 but raised SAP
MAE +0.39 and broke 3 of 7 golden fixtures. Reverted.
The prior agent concluded: *cert-calibration absorbs Elmhurst's
deviations from spec — we can't fix one without re-deriving the
calibration, so do a full spec sweep first and re-derive cert-cal at
the end.* This diagnosis is **wrong** and the proposed remedy
amplifies the problem.
### What was actually going on
The 250k-cert corpus spans multiple SAP spec-version regimes:
- **Pre-2025-03-14**: certs lodged under SAP 10.1 / SAP 10.2 amendment
0 prices — mains gas ~3.48 p, standard electricity 13.19 p.
- **Post-2025-03-14**: certs lodged under SAP 10.2 (14-03-2025) prices
— mains gas 3.64 p, standard electricity 16.49 p.
The `table_12_cert_calibration` prices (3.48 p / 13.19 p) are **the
older spec's prices**, not Elmhurst deviations from the spec. They
are an empirical "best fit" across a mixture distribution of two
price regimes, with downstream-component bugs (PCDB absence, HW
cylinder loss applied to combi, etc.) absorbed into the fit. The
table looks like compensation for assessor-software quirks because we
were never told which spec each cert was on.
Each "spec-correct fix that worsened MAE" in the failed slices above
was actually correct. The MAE regressed because:
1. The cert-cal prices (pre-March-2025 spec) cancelled with one set
of downstream errors to produce a quasi-stable cost.
2. The spec-correct fix landed → that cancellation broke → the
probe MAE went up.
3. But the spec-correct fix was *right* — what regressed was a
compensating-error equilibrium, not the calculator's truth.
The prior session's "re-derive cert-cal at the end" plan would
re-establish a new compensating-error equilibrium across the new bug
set. It does not converge on spec-correctness.
### The fix (per ADR-0010)
1. **Stop fitting against a mixture distribution.** Filter the
validation corpus to a single spec-version window (Validation
Cohort, `inspection_date ≥ 2025-07-01`). Every cert in the cohort
was lodged on SAP 10.2 (14-03-2025) prices.
2. **Delete the cert-calibration layer.** Use spec prices everywhere
(`domain.sap.tables.table_12`). The only price-routing decision
left is Table 12a fractional high-rate blending — a real spec
feature, not a calibration.
3. **Build PCDB**, because it dominates residual variance and the
reason it was deferred (cert-cal-absorbs-PCDB) no longer holds.
4. **Build trace mode and BRE worked-example fixtures**, so
per-section verification works against single-cert intermediates
instead of aggregate corpus MAE.
This is what §2.5 lists as the five prerequisites. Once they land,
the section-by-section spec sweep produces clean, monotonic
improvements.
---
## 4. Scope decisions (per ADR-0010)
### IN scope
- **SAP 10.2 (14-03-2025 amendment)** is the active spec target.
`docs/sap-spec/sap-10-2-full-specification-2025-03-14.pdf`, 199 pages.
- **RdSAP 10 (10-06-2025)** — the cert→input mapping layer that
cross-references SAP 10.2. `docs/sap-spec/rdsap-10-specification-2025-06-10.pdf`,
114 pages.
- **PCDB integration.** Moved from "Session C deferred" to **P4
prerequisite** (§2.5). Heat pumps and the 78 % of gas-boiler certs
lodging `main_heating_data_source=1` need PCDB-sourced efficiency
for the calculator to be spec-correct. Data source:
https://www.ncm-pcdb.org.uk; lookup keyed on `main_heating_index_number`;
fields: seasonal efficiency, secondary efficiency, output kW,
flow-temperature curve (HPs).
- **All RdSAP 10 sections in document order.** §1 → §§19, plus
Tables 27 / 28 / 29 / 30 / 31. The verification approach in §5 is
unchanged — only the precondition changes: the sweep runs against a
clean probe (Validation Cohort + spec prices + PCDB + trace mode).
### OUT of scope
- **Full SAP assessments.** Full-SAP certs lodge measured/calculated
U-values in `walls[i].description` (e.g.
"Average thermal transmittance 0.18 W/m²K"). These are a separate
calculation path (BS EN ISO 6946) and a different corpus. Park
until the RdSAP 10 base case parity is reached. S-B24 / S-B29
attempted partial handling; those slices can stay or be reverted at
your discretion when you reach §§4-7 of RdSAP and §3 of SAP 10.2.
- **SAP 10.3 (13-01-2026).** No SAP-10.3-lodged certs in the corpus,
so it cannot be validated. Calculator targets SAP 10.2 until the
corpus migrates (expected late 2026 / 2027 once BRE updates RdSAP
to reference SAP 10.3). Note: `table_12.py` currently mixes SAP
10.2 prices with SAP 10.3 CO2 factors — corrected as part of P2.
- **Historical-spec cert reproduction.** Calculating what cert SAP
*would have been* under SAP 10.1 / pre-March-2025 SAP 10.2 prices is
not the calculator's job. Lodged Performance carries the historical
value; Calculated SAP10 Performance is current-spec only. The
Validation Cohort filter operationalises this — older certs are
out of the validation loop, not because they're "wrong" but because
they're a different spec's output.
- **Re-deriving cert-cal at the end.** The prior session's plan. The
cert-calibration layer is deleted in P2, not re-fit.
---
## 5. The approach — section-by-section spec verification
Work through the RdSAP 10 spec **in document order**, starting at
§1. For each section:
### 5.1. Read the spec section
Read the section text fully. Note every rule, table reference, and
defaulting cascade.
### 5.2. Find the corresponding code
Map the section to the source file(s) implementing it. The current
mapping (some sections are split across modules):
| RdSAP 10 section | Code location |
|---|---|
| §1 Introduction / general | n/a |
| §2 Property descriptors | `datatypes/epc/domain/epc_property_data.py` |
| §3 Dimensions | `packages/domain/src/domain/sap/worksheet/dimensions.py` |
| §4 Ventilation | `packages/domain/src/domain/sap/worksheet/ventilation.py` |
| §5 Construction / U-values | `packages/domain/src/domain/ml/rdsap_uvalues.py` + `worksheet/heat_transmission.py` |
| §6 Windows / doors / overshading | `worksheet/solar_gains.py` + `rdsap/cert_to_inputs.py` |
| §7 Heating systems (refers to SAP 10.2 Appendix A) | `domain.ml.sap_efficiencies` + `rdsap/cert_to_inputs.py` |
| §8 Heating controls (Table 4e) | `rdsap/cert_to_inputs.py` |
| §9 Heat emitters / flow temperatures | not implemented |
| §10 Space and water heating (Appendix A) | `rdsap/cert_to_inputs.py` |
| §11 Additional items (PV, batteries, wind, hydro, shutters) | partial in `cert_to_inputs.py` (PV only) |
| §12 Electricity tariff | `rdsap/cert_to_inputs.py` (`_is_off_peak_meter`, fuel routing) |
| §13 Addendum to EPCs | n/a |
| §14 Special cases (e.g. flats above commercial) | not implemented |
| §15 Improvements (recommendations) | n/a (not rating) |
| §16-19 RdSAP-specific SAP rating equations | `worksheet/rating.py` |
| Table 27 — Living-area fraction | `rdsap/cert_to_inputs.py:_living_area_fraction` |
| Table 28 — Cylinder size defaults | `domain.ml.demand:_CYLINDER_VOLUME_L` |
| Table 29 — Heating + HW parameters | partial in `cert_to_inputs.py` |
| Table 30 — Mechanical ventilation | not implemented |
| Table 31 — Data to be collected | n/a |
### 5.3. For each spec rule in the section, check our code
For each table, formula, footnote, exception:
1. Does our code implement it?
2. Does the implementation match the spec values exactly?
3. Are there spec-defined edge cases / footnotes we're missing?
### 5.4. When a gap is found
- Write a failing unit test that asserts the spec-correct behaviour
— wherever possible, write it as an assertion on `intermediate`
values rather than on aggregate SAP, using a BRE worked example
if one covers the section.
- Implement the fix.
- Run `test_bre_worked_examples.py` plus the Validation Cohort
probe. Note both direction and magnitude of change.
- If a BRE worked-example breaks, the new code is wrong (revert).
BRE examples are spec-derived and cannot regress from a
spec-correct change.
- Commit per-slice: one section → one commit. Reference the spec
section in the commit message.
### 5.5. Sweep-time principle: worksheet-faithful structure
Each `worksheet/*.py` module must mirror the SAP 10.2 worksheet
structure for its section. As you verify a section, also restructure
its module so that:
1. **Each function name references its worksheet-line origin** (e.g.
`heat_transfer_coefficient` aligns with worksheet line (40);
`mean_internal_temperature` aligns with worksheet line (93)).
2. **Compound calculations are split** into one function per
worksheet line where possible — easier to verify against
`intermediate[...]` and against BRE worked-example values.
3. **Defensive type-handling disappears**. Once P6 lands, the input
is a typed enum or numeric — branching on `isinstance(x, int)` is
replaced by enum dispatch.
4. **Domain-typed inputs flow directly**. `SapBuildingPart.kind ==
BuildingPartKind.MAIN_DWELLING` replaces string sniffing of
`identifier`. The dimensions.py "unnecessarily complicated"
pattern Khalim flagged is the canonical example of what *not*
to do.
The principle applies during section-sweep slices. It is **not**
a separate prerequisite — the refactor lands with the verification
slice for the section it touches.
### 5.6. Use trace mode when you need it
P5 populates `SapResult.intermediate: dict[str, float]` with every
named SAP 10.2 worksheet variable. Each section's verification
benefits from inspecting these values per-cert. See §11 below for
the sketch.
---
## 6. What's already been done — section by section
This is your starting map. Each row says whether the section has been
touched and what the current state is.
### Walls / construction (§5)
- **S-B23 (committed `9a509e41`)**: Table 6 "Filled cavity" row dispatch
when `wall_insulation_type=2` AND `wall_construction=4`. Spec-anchored.
- **S-B24 (committed `15613309`)**: Parse `walls[i].description` for
"Average thermal transmittance X W/m²K". **PARK** — full-SAP path.
- **S-B25 (committed `6b934710`)**: Description-based dispatch for cavity
"as built, insulated (assumed)" + similar (type=4 with descriptive
signal). Spec-anchored via legacy `epc_wall_description_map`.
- **S-B26 (committed `361f9154`)**: `_insulation_bucket(0, True) → 50`
fix (the "NI" thickness sentinel) + description-based override of
`wall_ins_present` for non-cavity walls. Spec footnote (Table 6).
- **S-B27 (committed `1f49fa03`)**: Floor `_insulation_bucket` analog —
Table 19 footnote (2) "max(50, age-band default)" when description
signals retrofit.
- **S-B28 (committed `25261d5c`)**: Roof NI thickness + insulated
description → §5.11.4 footnote 50mm joist row.
- **S-B29 (committed `3ab09845`)**: Floor + roof "Average thermal
transmittance" parse. **PARK** — full-SAP path.
**Still to verify in §5**:
- Stone wall U-values for Scotland / Wales / NIR (Tables 7-10) — only
England is fully transcribed; country overrides are partial.
- Cob U-values (§5.6) — table only, no formula implementation.
- Stone formula §5.6 / §5.7 for non-standard wall thicknesses.
- Curtain wall §5.18 — not implemented.
- Party wall U-values (Table 15) — implemented in `u_party_wall`,
verify table values.
- Thermal bridging (Table 21) — implemented as global `y` factor,
verify per-age-band values.
- §5.16 Thermal mass — Table 22 (only 100 / 250 kJ/m²K, dispatched
by construction type with internal insulation). Currently we
hardcode 250 (see `cert_to_inputs.py:_DEFAULT_THERMAL_MASS_PARAMETER_KJ_PER_M2_K`).
This is wrong for timber-frame / cob / internally-insulated masonry
(should be 100).
### Heating systems (§§7-10, SAP Appendix A)
- **S-B20 (in history)**: Table 11 secondary heating allocation,
conditional on cert lodging secondary or being electric storage.
- **Failed S-B30 (reverted)**: respect `main_heating_fraction` —
shown empirically wrong. Field is multi-main allocation, not
main-vs-secondary. Spec verified against SAP 10.2 Appendix A1/A4.
- **S-B31 (committed `afdf297f`)**: Table 12c DLF on heat-network main.
Spec §C3.1 + Table 12c.
- **Failed S-B32 (room heater off-peak routing, reverted)**: Table 12a
says cat=10 room heaters on 7-hour tariff bill 100% high rate. Our
cert-cal extends off-peak to codes 691-696. Spec-correct fix
inverted bias direction — calibration was absorbing this.
- **Uncommitted HW cylinder fix**: spec-correct (combi → zero
storage/primary loss per Table 2 + Table 3 footers) but breaks 3
golden fixtures. Decision deferred to systematic pass.
**Still to verify in heating**:
- Table 4a efficiency values for every code (heat pumps, storage
heaters, room heaters, CPSU). The category-fallback (cat=4 → 2.30)
is documented as a known limitation.
- Boiler interlock penalty (5%) — spec §9.2.1: "The efficiency of
gas and liquid fuel boilers for both space and water heating is
reduced by 5% if the boiler is not interlocked for space and water
heating." We don't apply this. Known gap.
- Table 4c condensing-boiler / heat-pump emitter-temperature
adjustment — we don't apply this.
- Table 12a high-rate fractions for off-peak dwellings — we apply
100% off-peak or 100% standard, never fractional blending.
### Hot water (§4 SAP + Appendix J)
- Storage loss factor table (Table 2) — current values in
`domain.ml.demand:_STORAGE_LOSS_FACTOR` are ~3× off from spec
(verified). Known under-prediction of cylinder loss for storage
systems; cancelled by over-prediction of primary loss for combi
systems in aggregate.
- Primary loss formula (Table 3) — implemented as 245/60 kWh by age
band. Spec is a per-month formula `nₘ × 14 × [{0.0091·p + 0.0245·(1-p)}·h + 0.0263]`
with `p` (pipework insulation fraction) and `h` (circulation hours).
Known approximation.
- Combi-boiler zero-loss rule (Table 2 + Table 3 footers) — currently
NOT applied (the failed uncommitted slice). Adding this drops PE
MAE 6.64 but raises SAP MAE +0.39.
- Appendix J Vd formula `25N + 36` — currently the simple form, not
the full per-component (shower / bath / other) breakdown. Useful
HW demand is ~7% under spec value.
- ΔT — currently 43°C constant (5512). Spec uses monthly Tcold and
hot at 52°C, not 55°C. Per-month variance unmodelled.
### Lighting (Appendix L)
- `predicted_lighting_kwh` in `domain.ml.demand` uses `9.3 × TFA ×
(1 0.5·led_share 0.4·cfl_share)` heuristic.
- Spec is L1-L12: daylight correction, fixed-lighting capacity, top-up
+ portable shares, monthly profile.
- For LED-dominant home (50+ LEDs): our heuristic gives ~465 kWh, spec
gives ~94 kWh. Known over-prediction by ~5× for new-build LED homes.
### Internal gains (§5 SAP)
- `worksheet/internal_gains.py` implements metabolic + cooking +
appliances + lighting (the four positive rows of Table 5).
- **Missing**: Water heating row (`1000 × (65)ₘ / (nₘ × 24)` — i.e.
HW losses recycled as heated-space gains) and Losses row (`40 × N`
for cold inflow + evaporation). Both documented in S-B23 gap list.
### Ventilation (§4 / Table 5)
- Wind-shelter factor implemented in S-B21.
- Mechanical ventilation (MVHR, MEV, PIV) — not implemented; cert
rarely lodges. Spec §4.2 + Table 4g.
- Pressure-test override (worksheet lines 17-18) — not implemented.
### Tariff / cost (§12 + Table 12 / 12a / 12c)
- Cert-calibration prices in
`domain.sap.tables.table_12_cert_calibration` are an EMPIRICAL fit
to Elmhurst's output. They are LOWER than the published Table 12
spec values by 4-25%. Known divergence; investigation deferred.
- Standing charges (Table 12 note (a)) — NOT applied. Adding them
empirically worsens MAE (calibration absorbs).
- Table 12a high-rate fractions — currently 100% off-peak for E7-
eligible codes, 100% standard otherwise. No fractional blending.
- Heat network DLF (Table 12c) — applied per S-B31 only to main
heating + HW from main. HW-only-from-heat-network is a separate slice.
---
## 7. The cert-calibration "tension" is dissolved (per ADR-0010)
This section originally framed cert-calibration vs spec-correctness as
two end-states the calculator had to choose between. That framing is
wrong (see §3 for the actual diagnosis): the cert-cal values are
pre-March-2025 SAP prices, not Elmhurst deviations from SAP 10.2.
Once the corpus is filtered to the Validation Cohort (P3) and the
cert-cal layer is deleted (P2), the false dichotomy disappears.
### What replaces this section
- **One price table.** `domain.sap.tables.table_12` (re-labelled SAP
10.2 14-03-2025 amendment, CO2 factors corrected per P2).
- **One validation cohort.** `inspection_date ≥ 2025-07-01`, every
cert lodged on the calculator's target spec version.
- **One verification mechanism.** Trace-mode intermediates + BRE
worked-example unit tests for per-section verification; Validation
Cohort probe MAE for aggregate go/no-go.
Cert-software deviations from spec, if they exist at all, are
expected to be small and localised. They surface as residual after
the spec sweep completes against a clean probe — and at that point
the question is whether to chase them at all (Elmhurst-deviation
fixes have low domain value compared to spec-correctness, given the
calculator's product use case is scoring counterfactuals for the
MeasureApplicator chain, not reproducing historical certs).
---
## 7b. Outstanding findings to pick up during the systematic pass
The prior session identified several spec-correct fixes that were
reverted because they made SAP MAE worse against the **full corpus**.
The empirical signal that "reverted" them was version-mixture noise
(see §3) plus compensating-error breakage in the 7 retired golden
fixtures. Each fix below is **expected to land cleanly** once the
five prerequisites in §2.5 are done, because:
- The Validation Cohort (P3) is on a single spec version — the price
mismatch that drove the bias regression on standing charges and
cat=10 routing disappears.
- The cert-cal layer is gone (P2) — no calibration to "break".
- PCDB is integrated (P4) — the heat-pump and gas-boiler residuals
that dominated per-cert MAE collapse before any of these findings
even matter.
- The fixtures are now BRE worked examples (P5 + §10) — they cannot
be broken by spec-correct changes because they are themselves
derived from the spec.
Treat each finding as a section-sweep TODO. The empirical impacts
below were measured against the **dirty probe** (full corpus + cert-cal
+ no PCDB) and are **not predictive** of behaviour on the clean probe.
Re-measure each fix against the Validation Cohort after prerequisites
land.
### Finding 1 — HW cylinder zero-loss rule for combi boilers
**Status**: spec-correct fix exists in working-tree-only form
(uncommitted). Reverted at end of last session.
**Spec basis**:
- **SAP 10.2 Table 2 footer (page 158)**: "In the case of a
combination boiler: a) the storage loss factor is zero if the
efficiency is taken from Table 4b"
- **SAP 10.2 Table 3 footer (page 160)**: "Primary loss is set to
zero for the following: Electric immersion heater, Combi boiler
(including when it is part of a combined heat pump and boiler
package and provides all the hot water), CPSU (including electric
CPSU), Boiler and thermal store within a single casing, Separate
boiler and thermal store connected by no more than 1.5 m of
insulated pipework, Direct-acting electric boiler, Heat pump (not
combined heat pump and boiler package with a non-combi boiler)
from PCDB with hot water vessel integral to package"
**The bug**: our calculator currently adds storage loss (~135 kWh)
and primary loss (~245 kWh) for ALL certs with an age band lodged,
ignoring whether the dwelling has a cylinder. **67% of corpus certs
explicitly lodge `has_hot_water_cylinder=False`** (the modal combi
boiler case) — we add 380 kWh of fictional HW losses for each.
**The fix** (sketch, ~10 lines):
1. Add `has_cylinder: bool = True` keyword to
`predicted_hot_water_kwh` in `packages/domain/src/domain/ml/demand.py`.
2. When `has_cylinder=False`, set `storage_loss = 0` and `primary_loss = 0`.
3. In `cert_to_inputs.py` (around line 829), pass
`has_cylinder=epc.has_hot_water_cylinder and not is_instantaneous`.
**Empirical impact** (measured on 300-cert probe):
- **PE MAE: 43.32 → 36.68 (6.64) ← biggest single fix found this session**
- PE bias: 37.69 → 30.41 (7.28)
- SAP MAE: 4.61 → 5.00 (+0.39, regression)
- 3 of 7 golden fixtures break
**Why it was reverted**: the SAP regression + broken fixtures indicate
the fictional HW losses were partially compensating for OTHER bugs
(likely lighting over-prediction for LED-dominant homes). The right
ordering is: fix the spec-clear cases (HW cylinder, lighting per
Appendix L, etc.) together, then re-derive cert-cal.
**When to pick up**: when you reach §4 / Appendix J during the
systematic pass. Pair with the lighting Appendix L fix to avoid
breaking the golden fixtures individually.
### Finding 2 — Standing charges (Table 12 note (a))
**Status**: spec-correct, never implemented. Empirically rejected by
4-mode probe.
**Spec basis**: SAP 10.2 Table 12 note (a), page 190:
> "For calculations including regulated energy uses only (e.g.
> regulation compliance, energy ratings):
> - The standing charge for electricity standard tariff is omitted
> - The standing charge for off-peak electricity is added to space
> and water heating costs where either main heating or hot water
> uses off-peak electricity
> - The standing charge for gas fuels is added to space and water
> heating costs where the gas fuel is used for space heating
> (main or secondary) or for water heating"
**The bug**: our calculator never adds standing charges. Per spec, a
gas-heated dwelling should have £92/yr added to the ECF numerator.
**Empirical impact** (4-mode probe, 300 certs):
| Mode | All certs | Gas-only |
|---|---|---|
| cert-cal, no standing (current) | MAE 4.69, bias +0.98 | MAE 4.01, bias +0.80 |
| cert-cal + gas standing | MAE 4.94, bias **2.62** | MAE 4.31, bias **3.53** |
Adding standing charges shifts SAP bias by ~3.5 points downward —
clearly the wrong direction. The cert-cal prices (3.48p gas vs spec
3.64p) implicitly absorb the standing-charge contribution.
**When to pick up**: when you reach §12 / Table 12. Apply alongside
spec-correct unit prices (3.64p gas, 16.49p elec) and re-derive
cert-cal to match Elmhurst's residual deviation pattern.
### Finding 3 — Cat=10 room heaters off-peak routing
**Status**: spec-correct, currently bills room heaters at off-peak
rate on E7 dwellings. Empirically rejected.
**Spec basis**: SAP 10.2 Table 12a (page 191):
> "Other direct-acting electric heating (including electric secondary
> heating): 7-hour tariff 1.00 high rate; 10-hour tariff 0.50 high rate"
**The bug**: our cert-calibration (`cert_calibration_e7_codes`)
extends the off-peak set to include codes 691-696 (room heaters).
That's the S-B14 empirical extension — the previous agent found it
helped some specific certs. Per Table 12a it's WRONG: room heaters
on E7 should bill 100% at HIGH rate, not at low rate.
**Empirical impact**: switching from off-peak (5.50p cert-cal) to
standard rate (13.19p) — closer to spec but still not the high rate
(15.29p cert-cal) — inverted the bias from +5.88 to 6.00 without
improving MAE.
**The real issue**: Table 12a defines FRACTIONAL blending (e.g.
"90% high, 10% low" for direct-acting electric boiler on 7-hour
tariff), not binary on/off-peak. Our calculator only supports binary.
A proper implementation needs per-system high-rate fractions.
**When to pick up**: when you reach §12 / Table 12a. Implement
fractional blending for all the rows of Table 12a, not just cat=10.
### Finding 4 — Lighting (Appendix L proper)
**Status**: gap. Current code uses a 9.3 kWh/m² heuristic with simple
LED/CFL reductions; spec is the L1-L12 cascade with daylight
correction, fixed-lighting capacity, top-up + portable shares,
monthly profile.
**Spec basis**: SAP 10.2 Appendix L §L1 (pages 88-90), equations
L1-L12.
**The bug**: for a 100 m² LED-dominant home (e.g. cert 7536-3827 with
51 LEDs), our heuristic returns 465 kWh/yr; spec returns ~94 kWh/yr.
Over-prediction by ~5× on LED-dominant homes (which is most modern
stock).
**Empirical impact** (estimated):
- ~5-6 kWh/m² PEUI over-prediction for LED-dominant population
- Corpus-weighted: ~3-4 kWh/m² PEUI bias contribution
**When to pick up**: when you reach Appendix L. Pair with the HW
cylinder fix (Finding 1) to avoid the SAP MAE regression.
### Finding 5 — Internal-gains Table 5 missing rows
**Status**: gap. Spec Table 5 has 7 rows for internal gains; our
`worksheet/internal_gains.py` implements 4.
**Spec basis**: SAP 10.2 Table 5 (page 177).
**Missing rows**:
- **Water heating**: `1000 × (65)ₘ / (nₘ × 24)` W — the HW losses
(cylinder + distribution + primary) recycled as heated-space gains
via worksheet line (65). Reduces space heating demand.
- **Losses**: `40 × N` W — heat to incoming cold water and
evaporation. Negative contribution.
**Empirical impact** (estimated):
- For N=2.7: HW gains ≈+75 W, losses ≈108 W, net ≈33 W. Currently
we miss both → our gains are 33 W too high → space heating demand
too low → PE under-predicted by ~3 kWh/m² (rough).
**When to pick up**: when you reach §5 / Table 5. Worksheet line (65)
also needs implementation — the HW losses already exist in our calc
(see `demand.py:_cylinder_storage_loss_kwh` etc.), they just need
piping into internal_gains.
### Finding 6 — Storage-loss-factor table values are wrong
**Status**: gap. Affects only certs with `has_hot_water_cylinder=True`
(33% of corpus).
**Spec basis**: SAP 10.2 Table 2 (page 158).
**The bug**: `domain.ml.demand:_STORAGE_LOSS_FACTOR` values are ~3×
LOWER than spec. E.g. for 38mm foam our value is 0.0056, spec is
0.0181. Effect: we UNDER-predict cylinder storage loss by ~300 kWh
for storage systems, partly cancelling the over-prediction from
Finding 1.
**When to pick up**: when you reach §4 / Table 2. Fix WITH Finding 1
(combi zero-loss) so the cancellation doesn't dominate the
direction.
### Finding 7 — Heat-pump fallback efficiency 2.30
**Status**: gap that requires PCDB. See §8b.
### Finding 8 — Other smaller gaps (carry forward)
- Boiler interlock 5% penalty (§9.2.1) — never applied
- Table 4c condensing boiler / HP emitter temperature adjustment — never applied
- Control-temperature adjustment from Table 4e — always 0 in code, spec varies
- Wall U-values for Scotland / Wales / NIR — only England fully transcribed
- Per-junction thermal bridging (Table R2) — global y approximation only
- Multi-main heating (`main_heating_fraction` ≠ 1) — first main only
- Cooling §10 — not implemented (rare in UK)
- FEE §11 — not implemented (new-build only)
---
## 8. Don't repeat — known dead-ends
> **Re-read after §3 + §7b.** Three entries below were classified as
> "dead-ends because cert-cal absorbs" — that diagnosis is wrong.
> They are spec-correct fixes that were measured under a noisy probe.
> Now flagged as **conditional dead-ends**: dead only if you try them
> before P1P5 land. After prerequisites: they are expected
> improvements, not dead-ends. See ADR-0010.
- ❌ **Switching "NI" wall thickness to None alone** (S-B5 in history) —
over-corrected because it routed to the (Unfilled cavity, 50mm) row
instead of the dedicated Filled cavity row. The right fix landed in
S-B23 with a `WALL_INSULATION_FILLED_CAVITY` dispatcher.
- ❌ **Aggressive efficiency rescue for missing `sap_main_heating_code`**
(S-B5) — over-corrected. The category fallback (cat=4 → 2.30) is
intentionally conservative; PCDB (P4 prerequisite) supplies the
real efficiency.
- ⚠️ **Using SAP 10.2 spec prices for parity validation** — under
the dirty probe, cert-cal prices fit better. **Inverts under the
clean probe (P2 + P3): SAP 10.2 spec prices are correct because the
Validation Cohort is on the 14-03-2025 amendment.** Listed here
only as a warning if you start the sweep before prerequisites land.
- ❌ **Always applying 10% secondary heating** — must be conditional on
cert lodging or main system being electric storage (S-B20). See
spec Appendix A.4.
- ❌ **Respecting `main_heating_fraction` for secondary allocation**
(failed S-B30) — the field is the multi-main allocation (system 1 vs
system 2), not main-vs-secondary. SAP MAE 4.69 → 4.85 (worse).
- ⚠️ **Switching cat=10 room heaters off off-peak** (failed S-B32) —
spec-correct per Table 12a. The bias inversion under the dirty
probe was driven by cert-cal compensating; on the clean probe this
is just spec-correct. Land as part of the §12 spec sweep after
prerequisites.
- ⚠️ **Adding gas standing charges** (4-mode probe, unimplemented) —
spec-correct per Table 12 note (a). Same logic: bias drift under
dirty probe is version-mixture + missing-PCDB noise, not Elmhurst
deviation. Land as part of §12 spec sweep.
- ⚠️ **Zeroing storage + primary loss for combi boilers** (uncommitted
S-B32) — spec-correct per Table 2 + Table 3 footers. SAP MAE
regression was driven by the now-retired golden fixtures (§10) and
cert-cal absorption. Land as part of §4 / Appendix J sweep.
---
## 9. The cert corpus and parity probe
### Sample
`data/ml_training/runs/2025_2026_n250000_v18a/data.parquet` is the
250k-cert parquet. **After P1 lands** the parquet carries
`inspection_date`; the probe then filters to the **Validation Cohort**
(`inspection_date ≥ 2025-07-01`) plus `sap_score ∈ [5, 99]` and
samples 300 at seed=7 by default. Filtering rationale:
- ≤ 5 is heritage/anomaly stock (sub-3 % of corpus)
- ≥ 99 is full-SAP new-builds the parquet excludes anyway
- `inspection_date ≥ 2025-07-01` ensures every cert was lodged on
SAP 10.2 (14-03-2025 amendment) — see [CONTEXT.md](../../CONTEXT.md)
/ "Validation Cohort" and ADR-0010 §3.
### Run the probe
```bash
python -c "
import sys
sys.path.insert(0, 'packages/domain/src')
sys.path.insert(0, '.')
sys.path.insert(0, 'services/ml_training_data/src')
from ml_training_data.sap_parity_probe import main
main(['300','7'])
"
```
### What the probe shows
- Aggregate SAP MAE / RMSE / bias
- Aggregate PE MAE / RMSE / bias
- Per-end-use PEUI breakdown (space / HW / lighting / pumps)
- Stratification by `main_heating_category`, `construction_age_band`,
`dwelling_type`
- Worst-15 residuals (SAP and PE)
### Known parquet limitations
- ~0.7% of parquet certs have `construction_age_band=None` vs 15% in
the raw bulk-zip. The parquet filters out full-SAP new-builds
upstream. Don't measure full-SAP-path slices against the parquet.
- Heat-pump certs (cat=4) are under-represented and concentrated in
the worst-residual tail because PCDB efficiency is unavailable.
---
## 10. Fixtures: retire the 7 cert-based golden fixtures, replace with BRE worked examples (per ADR-0010 + P5)
The 7 cert-based fixtures at
`packages/domain/src/domain/sap/rdsap/tests/test_golden_fixtures.py`
were locked in against the current calculator state — *with* cert-cal,
*without* PCDB, *with* HW cylinder loss always applied, *with* the
lighting heuristic, etc. They are documented in §3 / the prior
handover as containing compensating errors. Once the prerequisites
land, every spec-correct fix breaks at least one of them. They will
fight the spec sweep.
### Replacement strategy
**Primary regression suite: BRE worked-example fixtures.**
Transcribe the worked examples from:
- SAP 10.2 spec appendices (especially Appendix R — reference values
and the worked example dwelling).
- RdSAP 10 (10-06-2025) worked-example annex.
Each worked example becomes a unit test that locks **per-intermediate
expected values** (HLP, HTC, mean internal temperature monthly, MIT,
ECF, SAP score) rather than the aggregate SAP score alone. Because
they are spec-derived, no spec-correct change can break them — any
break is an implementation bug, unambiguously.
These tests live at
`packages/domain/src/domain/sap/tests/test_bre_worked_examples.py`
(new module — separate from the cert-based fixtures module).
**Cert-based fixtures retired.**
The current `test_golden_fixtures.py` is either deleted or repurposed
as a *very loose* smoke-test integration suite (e.g. `|SAP residual|
≤ 5`) that catches catastrophic regressions only. The 7 cert JSONs
under `fixtures/golden/<cert>.json` can be kept on disk as reference
data, but they no longer drive go/no-go decisions in the sweep.
**Optional future addition.**
If/when a current Elmhurst (or Stroma / Quidos / NHER) license is
available, run a handful of representative corpus certs through it
and lock those outputs as a second-tier regression suite — Elmhurst-
parity fixtures alongside spec-parity fixtures. Not a prerequisite.
---
## 11. Trace mode (prerequisite P5 — implementation sketch)
This section was originally labelled "recommended"; it is now
**prerequisite P5** per ADR-0010. The sweep does not start until
`intermediate` is populated everywhere. ADR-0009 proposed:
```python
@dataclass(frozen=True)
class SapResult:
sap_score: float
...
intermediate: dict[str, float]
```
The `intermediate` field was never populated. Suggested implementation
for the systematic pass:
```python
intermediate = {
# §1 dimensions
"tfa_m2": tfa,
"volume_m3": volume,
"storey_count": storeys,
# §3 heat transmission
"walls_w_per_k": ht.walls_w_per_k,
"roof_w_per_k": ht.roof_w_per_k,
"floor_w_per_k": ht.floor_w_per_k,
"party_walls_w_per_k": ht.party_walls_w_per_k,
"windows_w_per_k": ht.windows_w_per_k,
"doors_w_per_k": ht.doors_w_per_k,
"thermal_bridging_w_per_k": ht.thermal_bridging_w_per_k,
"infiltration_ach": infiltration,
"infiltration_w_per_k": infiltration * volume * 0.33,
"heat_transfer_coefficient_w_per_k": hlc,
"heat_loss_parameter_w_per_m2k": hlp,
"time_constant_h": tau_h,
# §5 internal gains (annual averages)
"internal_gains_annual_avg_w": ...,
# §7 mean internal temperature (annual avg)
"mean_internal_temp_annual_avg_c": ...,
# §9 space heating
"useful_space_heating_kwh_per_yr": space_heating_kwh,
# §12 fuel costs (per end-use)
"main_heating_cost_gbp": ...,
"hot_water_cost_gbp": ...,
"lighting_cost_gbp": ...,
"pumps_fans_cost_gbp": ...,
# §13 rating
"ecf": ecf,
"deflator": 0.36,
# §14 primary energy and CO2 per end-use
"space_heating_pe_kwh_per_m2": ...,
"hot_water_pe_kwh_per_m2": ...,
...
}
```
Once populated, the differential debugging the reviewer recommended
becomes possible: change one input field, compare deltas against an
Elmhurst export.
---
## 12. Specific section-1 starting tasks (suggested first session)
A concrete pickup point:
### Session 1 — §1 (Introduction), §2 (Property Descriptors), §3 (Dimensions)
- §1 is prose; nothing to verify.
- §2 maps to `EpcPropertyData`. Verify that every field RdSAP §2
enumerates is present and correctly typed on the domain object.
Specifically check: `dwelling_type`, `built_form`, `property_type`,
`construction_age_band`, `country_code`. Note that
`construction_age_band` is per-building-part, not dwelling-level,
and the primary age band drives most defaults.
- §3 maps to `worksheet/dimensions.py`. Verify:
- Total floor area sum across building parts equals TFA
- Volume calculation per storey × area × height
- Storey count handling for extensions and room-in-roof
- Multi-storey heat-loss-perimeter rules
This single session should produce zero behaviour changes if §1-3 are
correctly implemented, but expect to find at least one issue in §3
geometry (per the reviewer's "biggest SAP error sources" list).
**Important:** Session 1 only starts after all five prerequisites in
§2.5 have landed and the Validation Cohort probe baseline has been
captured. Until then, running per-section verification produces noisy
signal.
Run the BRE worked-example fixtures (P5) + Validation Cohort probe
(P3) at the end of each session; expect no movement until you start
hitting actual gaps.
---
## 13. Workflow recap
**Phase 0 — Prerequisites (§2.5).** Land P1P6 first, in dependency
order:
| | Slice | Depends on |
|---|---|---|
| P1 | Re-extract parquet with `inspection_date` | — |
| P2 | Delete cert-cal; correct `table_12.py` CO2 factors | — |
| P3 | Filter parity probe to Validation Cohort | P1 |
| P4 | Implement `PcdbLookup` | — (P2 helpful) |
| P5 | Populate `SapResult.intermediate` + transcribe BRE worked examples | — |
| P6 | Strict-type `EpcPropertyData` via codes.csv-derived enums | — |
P1, P2, P4, P5, P6 can run in parallel. P3 needs P1. Capture a
Validation Cohort probe baseline once all six land — that is the new
MAE starting line. Repo-wide tests stay green throughout P6 (Site
Notes consumers, ML pipeline, recommendations, etc. all need the
mapper updates that accompany each typing change).
**Phase 1 — Section sweep.** For each RdSAP 10 section, in document
order:
1. Read the spec section text + cited tables.
2. Identify code location(s).
3. For each rule / table / footnote:
- Does our code implement it?
- Does the implementation match?
- Edge cases / fallback paths handled?
4. For each gap: AAA unit test (preferring a BRE worked-example
assertion on `intermediate` values when possible) → minimal
implementation → commit.
5. **Apply the worksheet-faithful structure principle** (§5.5) as
part of this slice: name functions after worksheet lines, split
compound calculations, replace any remaining defensive
type-handling with typed-enum dispatch.
6. After each commit: run `test_bre_worked_examples.py` + Validation
Cohort probe. Note both deltas in the commit message.
7. If a BRE worked-example breaks: the new code is wrong (revert).
The worked examples are spec-derived and cannot be broken by
spec-correct changes.
Stick to this. The prior session's mistake was jumping between
sections based on residual-size **on a dirty probe**. Clean probe
plus document-order discipline plus worksheet-faithful structure is
what makes the sweep converge.
---
## 14. Useful references
- **ADR-0010** `docs/adr/0010-sap10-calculator-spec-target-and-validation.md`
— the binding decisions reflected in this rewrite: SAP 10.2 target,
cert-cal deletion, Validation Cohort, PCDB-as-prerequisite, fixture
retirement. **Read first.**
- **ADR-0009** `docs/adr/0009-deterministic-sap-calculator.md` —
original calculator decision rationale + Session A/B/C plan. Read
for context; spec-version target / PCDB sequencing / cert-cal
rationale are superseded by ADR-0010.
- **Spec coverage map**
`docs/sap-spec/SPEC_COVERAGE.md` — pre-existing coverage tracker.
Update as you go.
- **Parity findings**
`docs/sap-spec/PARITY_FINDINGS.md` — empirical findings from prior
sessions.
- **Earlier handover**
`docs/sap-spec/HANDOVER_FRESH_REVIEW.md` — orientation from the
previous fresh-context pass.
- **Reviewer feedback (informal)** — chatGPT critique of the slice-by-
slice approach. Key recommendations: two-layer architecture
(RdSAP expansion → SAP worksheet), trace mode, golden-master
methodology, differential debugging, reference traces from
Elmhurst/Stroma/Quidos.
- **Commit log** — `git log --oneline` shows the slice history; each
S-Bxx commit message documents the spec ref + measured impact.
---
## 15. Final note
The prior session's framing — *"the cert-calibration layer absorbs
Elmhurst's spec deviations; we'll re-derive it at the end"* — was
load-bearing on a false diagnosis. The cert-cal layer is
pre-March-2025 SAP prices fit against a mixture distribution of two
spec-version regimes. Once you separate the regimes (Validation
Cohort) and use spec prices everywhere, the "tension" disappears.
After P1P5 land, the section sweep is straightforward: every
spec-correct fix is unambiguously the right answer, BRE
worked-example fixtures lock the result, and Validation Cohort probe
MAE moves monotonically downward. The fixes the prior session marked
as "spec-correct but probe-regressed" become trivially landable.
**Welcome to the project. Read ADR-0010, land the five prerequisites,
then walk the spec in document order. The deterministic answer is in
there.**