docs: handover — 1000-cert API accuracy study + next-steps + worksheet ask

Captures the wide-scale 2026-register study (41.8% <0.5, heating-driven
cluster table), the 7 slices shipped (S0380.219-225), the prioritised
remaining work (electric-heating clusters + worksheet-backed raises), and
the single highest-ROI worksheet to generate: an electric-storage-heater
house with a loose-jacket cylinder + a room-in-roof with Sheltered/
Adjacent gables + an extension — one document that validates the #1
accuracy cluster, pins the S0380.224 loose-jacket fix at 1e-4, closes the
gable_wall_type Table 4 raise, and exercises multi-bp fabric.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-06-04 16:37:03 +00:00
parent 9c0a373f7d
commit 19ed29e13c

View file

@ -0,0 +1,152 @@
# Handover — wide-scale API accuracy study + next steps
Point-in-time note. Start from [`AGENT_GUIDE.md`](AGENT_GUIDE.md) for methodology, the
1e-4 bar, the per-line debugging loop, the section helpers, and the suite command.
- **Branch:** `feature/per-cert-mapper-validation`
- **HEAD:** `9c0a373f` (S0380.225). Next slice: **S0380.226**.
- **Baseline (§4 suite):** `tests/domain/sap10_calculator/ backend/documents_parser/tests/`
→ green (2395 passed, 1 skipped). Pre-existing out-of-scope failures unchanged
(stone-§5.6 in `domain/sap10_ml/tests/`; `test_from_rdsap_schema.py::...test_total_floor_area`).
---
## What this study did
Fetched a **random 1,000-cert sample of domestic EPCs lodged JanMay 2026** from the
GOV.UK EPB register (the `/api/domestic/search` date-windowed endpoint to enumerate cert
numbers across random pages → `/api/certificate` per cert for the full schema-21 JSON), ran
each through the **API path** (`from_api_response → cert_to_inputs → continuous SAP`), and
compared to the lodged rounded `energy_rating_current`.
**This is the first measurement of raw-API behaviour on an unbiased population** — the curated
golden cohort (~exact) masked it.
### Reproduce
- Sampler/fetcher: `/tmp/sample_fetch_2026.py` → caches JSONs to `/tmp/epc_2026_sample/`.
- Evaluator: `/tmp/eval_sap_accuracy.py` → per-cert CSV + summary (`% <0.5`, buckets, worst-40,
raise breakdown). Cluster analysis: `/tmp/analyze2.py`. (Token in `backend/.env`
`OPEN_EPC_API_TOKEN`; `date_end` must be < today.)
- **These scripts are uncommitted (in /tmp).** Worth promoting to `scripts/` if this becomes
a recurring measurement.
---
## Headline (at HEAD `9c0a373f`)
| metric | value |
|---|---|
| computed | **882 / 1000** (100 unsupported pre-21 schema; 18 still raise) |
| **% \|err\| < 0.5** (of computed) | **41.8%** |
| % < 1.0 / < 2.0 / < 5.0 | 54.9% / 71.9% / 87.8% |
| median / mean \|err\| | 0.79 / ~2.4 |
| mean signed err | +0.2 (slight over-rate) |
**Accuracy is dominated by heating type** (the load-bearing cut):
| main_heating_category | n | mean \|err\| | %<0.5 | status |
|---|---|---|---|---|
| 2 = gas boiler (PCDB-indexed) | 579 | 1.30 | 48% | the well-trodden path |
| **7 = electric storage heaters** | 39 | **7.33** | **3%** | **broken — #1 lever** |
| **10 = electric room heaters** | 43 | **10.26** | **9%** | **broken — #2 lever** |
| 6 = community scheme | 38 | 2.28 | 34% | known-hard |
| Flats (any heating) | 242 | 3.19 | 29% | geometry + communal |
---
## Work shipped this session (S0380.219225)
Coverage unblocked **788 → 882 computed (+94)**; one real accuracy bug fixed (+22 certs).
| slice | fix | certs |
|---|---|---|
| S0380.219 | floor_construction 3 → "Suspended, not timber" (RdSAP 10 field 3-1) | ~44 |
| S0380.220 | floor_construction 0 → None (Table 19 unknown; proven inert) | 37 |
| S0380.221 | default missing `post_town` (unused metadata) | 1 |
| S0380.222 | roof_construction 6 (thatched) + 7 (dwelling above) → None (inert) | 5 |
| S0380.223 | `_part_geometry` early-return key contract (RR KeyError) | 5 |
| **S0380.224** | **loose-jacket cylinder storage loss (Table 2 Note 1)** — was None'd out → zero loss | **22** (mean err +2.29 → +0.45) |
| S0380.225 | §10.7 no-water-heating default A-F → 12mm loose jacket | 2 |
**S0380.224 is only DIRECTION-validated** (the 22 certs moved toward lodged + §4/golden stayed
green) — it has **no worksheet pin on the loose-jacket magnitude**. A worksheet with a
loose-jacket cylinder would close that (see "What to generate" below).
---
## Remaining work, prioritised
### A. Accuracy clusters (highest value — 80+ certs, mean err 710)
1. **Electric storage heaters (cat 7, 39 certs).** Distinct cascade — off-peak tariff split,
charge control (2401/2402), 7-hr/24-hr charge, Table 4a efficiency, responsiveness. **No
worksheet currently validates this path.** Errs both directions (27..+16).
2. **Electric room heaters (cat 10, 43 certs).** Likewise (controls 2601/2602/2603). Worst
cluster by mean (10.26).
3. **Flats (242, 29% <0.5)** and **PV (40, 28%)** — secondary.
### B. Remaining raises (18 certs — all U-value / heat-loss-sensitive, NOT enum guesses)
- **`gable_wall_type` 2 & 3 (14 certs).** RdSAP 10 **Table 4** RR walls: 0=Party (U=0.25),
1=Exposed (U=common wall), 2/3 = **Sheltered (U=external×R0.5)** + **Adjacent-to-heated
(U=0)**, code↔type order unconfirmed (schema says "not yet seen"). Needs (i) a worksheet to
pin which code is which + the U-values, and (ii) **calculator support** — the cascade only
has `gable_wall`/`gable_wall_external` kinds; Sheltered (R=0.5) and Adjacent (U=0) are new.
Best real example: `2818-3053-3203-2655-9204` lodges BOTH gable 2 and 3.
- **`main_heating_category: 9` = warm air, mains gas (1 cert).** Needs §9 warm-air dispatch.
- **`wall_insulation_thermal_conductivity` 3 (1 cert).** Verified it shifts wall U
(53.96→51.61 across λ) → worksheet-backed (the resolver's own discipline).
- **`floor_heat_loss` 8 (2 certs).** Semantically unconfirmed; inert for the 2 observed
(non-Main bp) but potentially "heated space below" (→ should exclude the floor, a calculator
change). Don't guess.
The clean mapper-enum raises are **exhausted** — every remaining raise changes the answer, which
is what the strict-raise guard exists to prevent.
---
## ★ What to generate — the single most productive worksheet
Heating is one-per-property, so one worksheet can't cover all four broken heating types. But
**fabric is independent of heating**, so the highest-ROI single artifact bundles the #1
accuracy cluster with the fabric that closes the gable raises and pins the loose-jacket fix.
**Build (in Elmhurst, a simulated case is fine — same as the existing `simulated case N`
worksheets) ONE property:**
> **A house heated by ELECTRIC STORAGE HEATERS, with a room-in-roof and a hot-water cylinder:**
> - **Heating:** electric storage heaters (off-peak / Economy-7 tariff), with a clear control
> type. *This is the load-bearing choice — it validates the 39-cert cat-7 cluster.*
> - **Hot water:** a cylinder with a **loose-jacket** insulation (not factory foam), a stated
> jacket thickness, and a cylinder thermostat. *Pins S0380.224's loose-jacket storage loss
> (56)m at 1e-4 — currently only direction-validated.*
> - **Room-in-roof** with **two gable walls of different types** — ideally one **"Sheltered"**
> and one **"Adjacent to another heated space"** (plus, if the tool allows, a Party and an
> Exposed gable). *Gives the Table 4 U-values for gable_wall_type 2 & 3 and disambiguates the
> code order — closes the 14-cert raise.*
> - **An extension (2nd building part)** with a different floor exposure (e.g. over unheated
> space or "to external air"). *Exercises multi-bp geometry + floor-exposure handling.*
From that single worksheet I can pin, at 1e-4: the electric-storage space-heating lines
((210)/(211)/space-heat), the loose-jacket storage loss (56)m, the RR gable U-values (30)/(32),
and the multi-bp fabric (27)(37). That's **one cluster + one fix-validation + the biggest
raise + fabric**, all in one document.
**If you'd rather do two:** add a second worksheet that is identical but with **electric room
heaters** instead of storage heaters — together they cover cat 7 + cat 10 (≈ 82 certs, the
two worst clusters). A third for a **community-heating flat** would cover cat 6 + the flat
geometry cluster.
### Then send me, per worksheet
The **Summary PDF** (the Elmhurst input/site-notes) + the **worksheet PDF** (the `(1)..(286)`
ground truth). With those I run both front-ends through the cascade and pin each line ref at
1e-4, exactly as for the `with api 3` pair (S0380.218).
---
## Conventions (unchanged)
One cause = one slice = one commit; spec citation (page+line) in the message; AAA tests
(`# Arrange / # Act / # Assert`); `abs(x - y) <= tol` (not `pytest.approx`); SAP 10.2 only; no
tolerance widening / xfail / rel-tol. New code passes pyright strict with ZERO NEW errors
(baseline-compare with `git stash`; mapper.py / cert_to_inputs.py / heat_transmission.py carry
pre-existing errors — compare counts). Stage files by name (the tree has unrelated
`pytest.ini`/`scripts/` changes that must NOT be staged).
`Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>`.