diff --git a/domain/sap10_calculator/docs/HANDOVER_API_SAMPLE_ACCURACY.md b/domain/sap10_calculator/docs/HANDOVER_API_SAMPLE_ACCURACY.md new file mode 100644 index 00000000..7a7094c3 --- /dev/null +++ b/domain/sap10_calculator/docs/HANDOVER_API_SAMPLE_ACCURACY.md @@ -0,0 +1,152 @@ +# Handover — wide-scale API accuracy study + next steps + +Point-in-time note. Start from [`AGENT_GUIDE.md`](AGENT_GUIDE.md) for methodology, the +1e-4 bar, the per-line debugging loop, the section helpers, and the suite command. + +- **Branch:** `feature/per-cert-mapper-validation` +- **HEAD:** `9c0a373f` (S0380.225). Next slice: **S0380.226**. +- **Baseline (§4 suite):** `tests/domain/sap10_calculator/ backend/documents_parser/tests/` + → green (2395 passed, 1 skipped). Pre-existing out-of-scope failures unchanged + (stone-§5.6 in `domain/sap10_ml/tests/`; `test_from_rdsap_schema.py::...test_total_floor_area`). + +--- + +## What this study did + +Fetched a **random 1,000-cert sample of domestic EPCs lodged Jan–May 2026** from the +GOV.UK EPB register (the `/api/domestic/search` date-windowed endpoint to enumerate cert +numbers across random pages → `/api/certificate` per cert for the full schema-21 JSON), ran +each through the **API path** (`from_api_response → cert_to_inputs → continuous SAP`), and +compared to the lodged rounded `energy_rating_current`. + +**This is the first measurement of raw-API behaviour on an unbiased population** — the curated +golden cohort (~exact) masked it. + +### Reproduce +- Sampler/fetcher: `/tmp/sample_fetch_2026.py` → caches JSONs to `/tmp/epc_2026_sample/`. +- Evaluator: `/tmp/eval_sap_accuracy.py` → per-cert CSV + summary (`% <0.5`, buckets, worst-40, + raise breakdown). Cluster analysis: `/tmp/analyze2.py`. (Token in `backend/.env` + `OPEN_EPC_API_TOKEN`; `date_end` must be < today.) +- **These scripts are uncommitted (in /tmp).** Worth promoting to `scripts/` if this becomes + a recurring measurement. + +--- + +## Headline (at HEAD `9c0a373f`) + +| metric | value | +|---|---| +| computed | **882 / 1000** (100 unsupported pre-21 schema; 18 still raise) | +| **% \|err\| < 0.5** (of computed) | **41.8%** | +| % < 1.0 / < 2.0 / < 5.0 | 54.9% / 71.9% / 87.8% | +| median / mean \|err\| | 0.79 / ~2.4 | +| mean signed err | +0.2 (slight over-rate) | + +**Accuracy is dominated by heating type** (the load-bearing cut): + +| main_heating_category | n | mean \|err\| | %<0.5 | status | +|---|---|---|---|---| +| 2 = gas boiler (PCDB-indexed) | 579 | 1.30 | 48% | the well-trodden path | +| **7 = electric storage heaters** | 39 | **7.33** | **3%** | **broken — #1 lever** | +| **10 = electric room heaters** | 43 | **10.26** | **9%** | **broken — #2 lever** | +| 6 = community scheme | 38 | 2.28 | 34% | known-hard | +| Flats (any heating) | 242 | 3.19 | 29% | geometry + communal | + +--- + +## Work shipped this session (S0380.219–225) + +Coverage unblocked **788 → 882 computed (+94)**; one real accuracy bug fixed (+22 certs). + +| slice | fix | certs | +|---|---|---| +| S0380.219 | floor_construction 3 → "Suspended, not timber" (RdSAP 10 field 3-1) | ~44 | +| S0380.220 | floor_construction 0 → None (Table 19 unknown; proven inert) | 37 | +| S0380.221 | default missing `post_town` (unused metadata) | 1 | +| S0380.222 | roof_construction 6 (thatched) + 7 (dwelling above) → None (inert) | 5 | +| S0380.223 | `_part_geometry` early-return key contract (RR KeyError) | 5 | +| **S0380.224** | **loose-jacket cylinder storage loss (Table 2 Note 1)** — was None'd out → zero loss | **22** (mean err +2.29 → +0.45) | +| S0380.225 | §10.7 no-water-heating default A-F → 12mm loose jacket | 2 | + +**S0380.224 is only DIRECTION-validated** (the 22 certs moved toward lodged + §4/golden stayed +green) — it has **no worksheet pin on the loose-jacket magnitude**. A worksheet with a +loose-jacket cylinder would close that (see "What to generate" below). + +--- + +## Remaining work, prioritised + +### A. Accuracy clusters (highest value — 80+ certs, mean err 7–10) +1. **Electric storage heaters (cat 7, 39 certs).** Distinct cascade — off-peak tariff split, + charge control (2401/2402), 7-hr/24-hr charge, Table 4a efficiency, responsiveness. **No + worksheet currently validates this path.** Errs both directions (−27..+16). +2. **Electric room heaters (cat 10, 43 certs).** Likewise (controls 2601/2602/2603). Worst + cluster by mean (10.26). +3. **Flats (242, 29% <0.5)** and **PV (40, 28%)** — secondary. + +### B. Remaining raises (18 certs — all U-value / heat-loss-sensitive, NOT enum guesses) +- **`gable_wall_type` 2 & 3 (14 certs).** RdSAP 10 **Table 4** RR walls: 0=Party (U=0.25), + 1=Exposed (U=common wall), 2/3 = **Sheltered (U=external×R0.5)** + **Adjacent-to-heated + (U=0)**, code↔type order unconfirmed (schema says "not yet seen"). Needs (i) a worksheet to + pin which code is which + the U-values, and (ii) **calculator support** — the cascade only + has `gable_wall`/`gable_wall_external` kinds; Sheltered (R=0.5) and Adjacent (U=0) are new. + Best real example: `2818-3053-3203-2655-9204` lodges BOTH gable 2 and 3. +- **`main_heating_category: 9` = warm air, mains gas (1 cert).** Needs §9 warm-air dispatch. +- **`wall_insulation_thermal_conductivity` 3 (1 cert).** Verified it shifts wall U + (53.96→51.61 across λ) → worksheet-backed (the resolver's own discipline). +- **`floor_heat_loss` 8 (2 certs).** Semantically unconfirmed; inert for the 2 observed + (non-Main bp) but potentially "heated space below" (→ should exclude the floor, a calculator + change). Don't guess. + +The clean mapper-enum raises are **exhausted** — every remaining raise changes the answer, which +is what the strict-raise guard exists to prevent. + +--- + +## ★ What to generate — the single most productive worksheet + +Heating is one-per-property, so one worksheet can't cover all four broken heating types. But +**fabric is independent of heating**, so the highest-ROI single artifact bundles the #1 +accuracy cluster with the fabric that closes the gable raises and pins the loose-jacket fix. + +**Build (in Elmhurst, a simulated case is fine — same as the existing `simulated case N` +worksheets) ONE property:** + +> **A house heated by ELECTRIC STORAGE HEATERS, with a room-in-roof and a hot-water cylinder:** +> - **Heating:** electric storage heaters (off-peak / Economy-7 tariff), with a clear control +> type. *This is the load-bearing choice — it validates the 39-cert cat-7 cluster.* +> - **Hot water:** a cylinder with a **loose-jacket** insulation (not factory foam), a stated +> jacket thickness, and a cylinder thermostat. *Pins S0380.224's loose-jacket storage loss +> (56)m at 1e-4 — currently only direction-validated.* +> - **Room-in-roof** with **two gable walls of different types** — ideally one **"Sheltered"** +> and one **"Adjacent to another heated space"** (plus, if the tool allows, a Party and an +> Exposed gable). *Gives the Table 4 U-values for gable_wall_type 2 & 3 and disambiguates the +> code order — closes the 14-cert raise.* +> - **An extension (2nd building part)** with a different floor exposure (e.g. over unheated +> space or "to external air"). *Exercises multi-bp geometry + floor-exposure handling.* + +From that single worksheet I can pin, at 1e-4: the electric-storage space-heating lines +((210)/(211)/space-heat), the loose-jacket storage loss (56)m, the RR gable U-values (30)/(32), +and the multi-bp fabric (27)–(37). That's **one cluster + one fix-validation + the biggest +raise + fabric**, all in one document. + +**If you'd rather do two:** add a second worksheet that is identical but with **electric room +heaters** instead of storage heaters — together they cover cat 7 + cat 10 (≈ 82 certs, the +two worst clusters). A third for a **community-heating flat** would cover cat 6 + the flat +geometry cluster. + +### Then send me, per worksheet +The **Summary PDF** (the Elmhurst input/site-notes) + the **worksheet PDF** (the `(1)..(286)` +ground truth). With those I run both front-ends through the cascade and pin each line ref at +1e-4, exactly as for the `with api 3` pair (S0380.218). + +--- + +## Conventions (unchanged) +One cause = one slice = one commit; spec citation (page+line) in the message; AAA tests +(`# Arrange / # Act / # Assert`); `abs(x - y) <= tol` (not `pytest.approx`); SAP 10.2 only; no +tolerance widening / xfail / rel-tol. New code passes pyright strict with ZERO NEW errors +(baseline-compare with `git stash`; mapper.py / cert_to_inputs.py / heat_transmission.py carry +pre-existing errors — compare counts). Stage files by name (the tree has unrelated +`pytest.ini`/`scripts/` changes that must NOT be staged). +`Co-Authored-By: Claude Opus 4.8 `.