diff --git a/docs/HANDOVER_REAL_LIFE_CERT_ACCURACY.md b/docs/HANDOVER_REAL_LIFE_CERT_ACCURACY.md new file mode 100644 index 00000000..58d8ec08 --- /dev/null +++ b/docs/HANDOVER_REAL_LIFE_CERT_ACCURACY.md @@ -0,0 +1,131 @@ +# Handoff — Real-life cert SAP accuracy (validate → fix → expand) + +**Purpose.** Grow `real_life_examples` into a trustworthy regression corpus that +validates this repo's SAP calculator against accredited **Elmhurst Energy**, +one real certificate at a time — and use each cert to *improve the mapper and +calculator* and *add test coverage*. This is the strategy/context doc; the +step-by-step procedure is the **`/validate-cert-sap-accuracy`** skill. + +--- + +## TL;DR — the loop per cert + +Run **`/validate-cert-sap-accuracy `**. It drives: + +1. `scripts/fetch_real_life_epc_sample.py ` → saves + `backend/epc_api/json_samples/real_life_examples//uprn_/epc.json`, + prints schema + lodged rating + our engine's SAP. +2. `/epc-to-elmhurst-rdsap-inputs ` → writes `elmhurst_inputs.md` (page-by-page + Elmhurst entry sheet with code→value mappings). +3. **You** build it in Elmhurst, export the **Summary** and **SAP-10.2 worksheet** + PDFs → save as `elmhurst_summary.pdf` / `elmhurst_worksheet.pdf` in the sample dir. +4. `scripts/compare_epc_paths.py ` → builds `EpcPropertyData` from BOTH the + gov-API json and the Elmhurst summary, deep-diffs them, runs BOTH through + `Sap10Calculator`, and prints Elmhurst's worksheet SAP. +5. **Reconcile** the field diffs to convergence (see "Reading the comparison"). +6. **Pin** the agreed score: add a `RealCertExpectation` to + `tests/domain/sap10_calculator/test_real_cert_sap_accuracy.py`; the sample + dir is already the corpus entry, so the pin is what activates it. + +## Reading the comparison (the core skill) + +- **Our engine on Elmhurst inputs ≈ Elmhurst's worksheet SAP** → the *calculator* + is correct. (Proven repeatedly — it matches Elmhurst's fuel cost to the penny.) +- **gov-API SAP vs Elmhurst-PDF SAP gap** → *input* differences only. Triage each + field diff into: + - **Elmhurst data-entry error** (swapped floor dims, wrong cylinder/immersion, + missing baths, wrong postcode) → fix in Elmhurst, re-export, re-compare. + - **gov-API mapper gap** → a real per-cert-mapper fix (improve the mapper). Flag + it; **don't tune to mask it**. + - **Ground-truth question** (what the property *actually* is) → you settle it; + align both sides to the lodged data. +- Ignore cosmetic diffs: codes vs strings (tenure, region), empty `EnergyElement` + lists (the Elmhurst path stores construction in `sap_building_parts`). + +## Mental model (hard-won) + +- **The calculator is essentially exact.** Fed identical inputs it reproduces + accredited Elmhurst. So accuracy work is almost entirely **mapper fidelity** — + making the gov-API `EpcPropertyData` match what an assessor would key in. +- **Lodged `energy_rating_current` is NOT a clean target** for pre-SAP10 schemas + (17.x–19.0 lodge SAP-2012 ratings — a different methodology). Use **Elmhurst on + the lodged inputs** as ground truth; cite lodged only as context. +- **Pin the observed gov-API engine score**, not the lodged or Elmhurst number — + the test guards the production path. Record the Elmhurst-validated value + what + reconciled it in the comment. + +## Code→value cheatsheet (the gotchas that bit us) + +Full table: `.claude/skills/epc-to-elmhurst-rdsap-inputs/reference/mapping.md`. The +ones that cost us time: + +| Field | Mapping | Note | +|---|---|---| +| `immersion_heating_type` | **1 = DUAL, 2 = SINGLE** | flips Table 13 eqn; swung cert 10002468137 by 4 SAP | +| `cylinder_size` | **2 = 110 L, 3 = 160 L, 4 = 210 L** | pick the litres in Elmhurst, not the label | +| `party_wall_construction` | 1=Solid (U 0), 2=cavity unfilled (0.5), 3=filled (0.2), 4/5=unknown (0.25) | code 1 ≠ "unable to determine" | +| `cylinder_insulation_type` | 1=Foam, 2=Jacket | — | +| Number of baths | `rooms_with_bath_and_or_shower + rooms_with_bath_and_mixer_shower` | Elmhurst WWHRS sub-tab, defaults to 0 | +| Off-peak fuel (`29`) | space-heat 100% low rate (correct for storage heaters); water-heat = Table 13 split | meter = Economy-7/Dual | +| `water_heating_code` 903 | Electric immersion off-peak → Elmhurst **"Water Heater"** category | not "Boiler Circulator" (901) | +| Windows (reduced-field) | area = 0.148 × TFA × band; raw U from glazing code via `u_window` (RdSAP Table 24) | not real geometry | + +## Known mapper gaps to chase (improve the mapper) + +- **Lodged alt-wall dropped** — `sap_building_parts[].sap_alternative_wall_1` is + `None` on the gov-API path even when the cert lodges one (Elmhurst keeps it). + ~£1 / 0.06 SAP on cert 10002468137, but real. (per-cert-mapper / Khalim's domain.) +- Add more as new certs surface them — that's the point of expanding the corpus. + +## Picking certs for coverage + +Maximise variety so each cert exercises new mapper/calculator paths: +- **Heating**: gas combi, gas boiler + cylinder, oil, LPG, solid fuel, heat pump + (ASHP/GSHP), storage heaters (done), electric boiler, community/heat-network. +- **Hot water**: combi, cylinder (foam/jacket), immersion (single/dual), solar HW, + WWHRS, instantaneous electric. +- **Schema**: 17.0, 17.1 (done), 18.0 (done), 19.0, 20.0.0, 21.0.0, 21.0.1. +- **Geometry**: flats (ground/mid/top floor), bungalow, extensions, room-in-roof, + conservatory, basement. +- **Tariff/region**: mains gas, off-peak electric (done), 10/18/24-hour, varied regions. +- **Tech**: PV (export/non-export), wind, FGHRS. + +## Guardrails + +- **RdSAP-21.0.1 corpus gauge** (`tests/infrastructure/epc_client/test_sap_accuracy_corpus.py`, + currently 66.9% within-0.5 SAP) is the broad regression net for any mapper/calc + change. **Ratchet thresholds up, never loosen.** Re-run it after every change. +- **Don't tune the mapper to one cert** — fix generically and confirm against the + gauge. A single-cert tweak that regresses the corpus is net-negative. +- **No tolerance widening** in the real-cert test — pin the observed integer SAP; + if a known engine bug blocks a cert, use `known_bug_xfail="…"` (strict xfail). + +## Current corpus + +| Sample | Schema | Pin | Status | +|---|---|---|---| +| `uprn_100020450179` | RdSAP-18.0 | 73 | matches lodged 73 | +| `uprn_10002468137` | RdSAP-17.1 | 61 | Elmhurst-validated (dual immersion, 110 L, 2 baths); lodged 55 = old schema | +| `uprn_10092973954` | SAP-17.1 (full SAP) | 77 | full-SAP mapper partial; pinned to observed (not lodged 83) | + +## Open threads + +- **Full-SAP mapper WIP** (`_sap_door_aggregates` D2 door slice) is parked in + `git stash` (`hyde-wip-before-main-merge`, `stash@{0}`) — not mine; the full-SAP + effort should `git stash show -p stash@{0}` and re-apply. Full-SAP (`SAP-Schema-*`) + support is incomplete; RdSAP schemas are the solid path. +- The off-peak water-heating (Table 13) fix and the per-cert-mapper accuracy work + landed on `main` via PR #1217 — pull `main` before starting new certs. + +## Key files + +| Concern | Path | +|---|---| +| Per-cert loop (procedure) | skill `validate-cert-sap-accuracy` | +| Cert → Elmhurst input sheet | skill `epc-to-elmhurst-rdsap-inputs` (+ `reference/mapping.md`) | +| Capture a cert | `scripts/fetch_real_life_epc_sample.py` | +| Compare the two paths | `scripts/compare_epc_paths.py` | +| The accuracy test (pins) | `tests/domain/sap10_calculator/test_real_cert_sap_accuracy.py` | +| Corpus regression gauge | `tests/infrastructure/epc_client/test_sap_accuracy_corpus.py` | +| Mapper (improve here) | `datatypes/epc/domain/mapper.py`, `domain/sap10_calculator/rdsap/cert_to_inputs.py` | +| Calculator | `domain/sap10_calculator/` |