docs: handoff for expanding the real-life cert SAP-accuracy corpus

Strategy/context companion to the validate-cert-sap-accuracy skill: the
per-cert loop, how to read the gov-API-vs-Elmhurst comparison, the code->value
gotchas (immersion/cylinder/party-wall/baths/off-peak), known mapper gaps to
chase (alt-wall drop), cert-selection for coverage, guardrails (corpus gauge,
no tuning to one cert, no tolerance widening), and the current corpus state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Jun-te Kim 2026-06-15 15:28:40 +00:00
parent 5c11fd35c8
commit e289c1449b

View file

@ -0,0 +1,131 @@
# Handoff — Real-life cert SAP accuracy (validate → fix → expand)
**Purpose.** Grow `real_life_examples` into a trustworthy regression corpus that
validates this repo's SAP calculator against accredited **Elmhurst Energy**,
one real certificate at a time — and use each cert to *improve the mapper and
calculator* and *add test coverage*. This is the strategy/context doc; the
step-by-step procedure is the **`/validate-cert-sap-accuracy`** skill.
---
## TL;DR — the loop per cert
Run **`/validate-cert-sap-accuracy <uprn>`**. It drives:
1. `scripts/fetch_real_life_epc_sample.py <uprn>` → saves
`backend/epc_api/json_samples/real_life_examples/<schema>/uprn_<uprn>/epc.json`,
prints schema + lodged rating + our engine's SAP.
2. `/epc-to-elmhurst-rdsap-inputs <uprn>` → writes `elmhurst_inputs.md` (page-by-page
Elmhurst entry sheet with code→value mappings).
3. **You** build it in Elmhurst, export the **Summary** and **SAP-10.2 worksheet**
PDFs → save as `elmhurst_summary.pdf` / `elmhurst_worksheet.pdf` in the sample dir.
4. `scripts/compare_epc_paths.py <uprn>` → builds `EpcPropertyData` from BOTH the
gov-API json and the Elmhurst summary, deep-diffs them, runs BOTH through
`Sap10Calculator`, and prints Elmhurst's worksheet SAP.
5. **Reconcile** the field diffs to convergence (see "Reading the comparison").
6. **Pin** the agreed score: add a `RealCertExpectation` to
`tests/domain/sap10_calculator/test_real_cert_sap_accuracy.py`; the sample
dir is already the corpus entry, so the pin is what activates it.
## Reading the comparison (the core skill)
- **Our engine on Elmhurst inputs ≈ Elmhurst's worksheet SAP** → the *calculator*
is correct. (Proven repeatedly — it matches Elmhurst's fuel cost to the penny.)
- **gov-API SAP vs Elmhurst-PDF SAP gap***input* differences only. Triage each
field diff into:
- **Elmhurst data-entry error** (swapped floor dims, wrong cylinder/immersion,
missing baths, wrong postcode) → fix in Elmhurst, re-export, re-compare.
- **gov-API mapper gap** → a real per-cert-mapper fix (improve the mapper). Flag
it; **don't tune to mask it**.
- **Ground-truth question** (what the property *actually* is) → you settle it;
align both sides to the lodged data.
- Ignore cosmetic diffs: codes vs strings (tenure, region), empty `EnergyElement`
lists (the Elmhurst path stores construction in `sap_building_parts`).
## Mental model (hard-won)
- **The calculator is essentially exact.** Fed identical inputs it reproduces
accredited Elmhurst. So accuracy work is almost entirely **mapper fidelity**
making the gov-API `EpcPropertyData` match what an assessor would key in.
- **Lodged `energy_rating_current` is NOT a clean target** for pre-SAP10 schemas
(17.x19.0 lodge SAP-2012 ratings — a different methodology). Use **Elmhurst on
the lodged inputs** as ground truth; cite lodged only as context.
- **Pin the observed gov-API engine score**, not the lodged or Elmhurst number —
the test guards the production path. Record the Elmhurst-validated value + what
reconciled it in the comment.
## Code→value cheatsheet (the gotchas that bit us)
Full table: `.claude/skills/epc-to-elmhurst-rdsap-inputs/reference/mapping.md`. The
ones that cost us time:
| Field | Mapping | Note |
|---|---|---|
| `immersion_heating_type` | **1 = DUAL, 2 = SINGLE** | flips Table 13 eqn; swung cert 10002468137 by 4 SAP |
| `cylinder_size` | **2 = 110 L, 3 = 160 L, 4 = 210 L** | pick the litres in Elmhurst, not the label |
| `party_wall_construction` | 1=Solid (U 0), 2=cavity unfilled (0.5), 3=filled (0.2), 4/5=unknown (0.25) | code 1 ≠ "unable to determine" |
| `cylinder_insulation_type` | 1=Foam, 2=Jacket | — |
| Number of baths | `rooms_with_bath_and_or_shower + rooms_with_bath_and_mixer_shower` | Elmhurst WWHRS sub-tab, defaults to 0 |
| Off-peak fuel (`29`) | space-heat 100% low rate (correct for storage heaters); water-heat = Table 13 split | meter = Economy-7/Dual |
| `water_heating_code` 903 | Electric immersion off-peak → Elmhurst **"Water Heater"** category | not "Boiler Circulator" (901) |
| Windows (reduced-field) | area = 0.148 × TFA × band; raw U from glazing code via `u_window` (RdSAP Table 24) | not real geometry |
## Known mapper gaps to chase (improve the mapper)
- **Lodged alt-wall dropped**`sap_building_parts[].sap_alternative_wall_1` is
`None` on the gov-API path even when the cert lodges one (Elmhurst keeps it).
~£1 / 0.06 SAP on cert 10002468137, but real. (per-cert-mapper / Khalim's domain.)
- Add more as new certs surface them — that's the point of expanding the corpus.
## Picking certs for coverage
Maximise variety so each cert exercises new mapper/calculator paths:
- **Heating**: gas combi, gas boiler + cylinder, oil, LPG, solid fuel, heat pump
(ASHP/GSHP), storage heaters (done), electric boiler, community/heat-network.
- **Hot water**: combi, cylinder (foam/jacket), immersion (single/dual), solar HW,
WWHRS, instantaneous electric.
- **Schema**: 17.0, 17.1 (done), 18.0 (done), 19.0, 20.0.0, 21.0.0, 21.0.1.
- **Geometry**: flats (ground/mid/top floor), bungalow, extensions, room-in-roof,
conservatory, basement.
- **Tariff/region**: mains gas, off-peak electric (done), 10/18/24-hour, varied regions.
- **Tech**: PV (export/non-export), wind, FGHRS.
## Guardrails
- **RdSAP-21.0.1 corpus gauge** (`tests/infrastructure/epc_client/test_sap_accuracy_corpus.py`,
currently 66.9% within-0.5 SAP) is the broad regression net for any mapper/calc
change. **Ratchet thresholds up, never loosen.** Re-run it after every change.
- **Don't tune the mapper to one cert** — fix generically and confirm against the
gauge. A single-cert tweak that regresses the corpus is net-negative.
- **No tolerance widening** in the real-cert test — pin the observed integer SAP;
if a known engine bug blocks a cert, use `known_bug_xfail="…"` (strict xfail).
## Current corpus
| Sample | Schema | Pin | Status |
|---|---|---|---|
| `uprn_100020450179` | RdSAP-18.0 | 73 | matches lodged 73 |
| `uprn_10002468137` | RdSAP-17.1 | 61 | Elmhurst-validated (dual immersion, 110 L, 2 baths); lodged 55 = old schema |
| `uprn_10092973954` | SAP-17.1 (full SAP) | 77 | full-SAP mapper partial; pinned to observed (not lodged 83) |
## Open threads
- **Full-SAP mapper WIP** (`_sap_door_aggregates` D2 door slice) is parked in
`git stash` (`hyde-wip-before-main-merge`, `stash@{0}`) — not mine; the full-SAP
effort should `git stash show -p stash@{0}` and re-apply. Full-SAP (`SAP-Schema-*`)
support is incomplete; RdSAP schemas are the solid path.
- The off-peak water-heating (Table 13) fix and the per-cert-mapper accuracy work
landed on `main` via PR #1217 — pull `main` before starting new certs.
## Key files
| Concern | Path |
|---|---|
| Per-cert loop (procedure) | skill `validate-cert-sap-accuracy` |
| Cert → Elmhurst input sheet | skill `epc-to-elmhurst-rdsap-inputs` (+ `reference/mapping.md`) |
| Capture a cert | `scripts/fetch_real_life_epc_sample.py` |
| Compare the two paths | `scripts/compare_epc_paths.py` |
| The accuracy test (pins) | `tests/domain/sap10_calculator/test_real_cert_sap_accuracy.py` |
| Corpus regression gauge | `tests/infrastructure/epc_client/test_sap_accuracy_corpus.py` |
| Mapper (improve here) | `datatypes/epc/domain/mapper.py`, `domain/sap10_calculator/rdsap/cert_to_inputs.py` |
| Calculator | `domain/sap10_calculator/` |