Model/docs/HANDOVER_API_ACCURACY_S2.md
Khalim Conn-Kowlessar fb350036b1 docs: session-2 API-accuracy handover (fabric+tariff fixes, worksheet path)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 19:13:21 +00:00

150 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Handover — API SAP accuracy (session 2): fabric + tariff fixes, and why we now need worksheets
**Branch:** `feature/per-cert-mapper-validation` (long-lived working branch — **NEVER PR to
main**; the user pushes/PRs when ready). **HEAD `4d1a58b8`**, local-only ahead of origin.
**READ ALSO:** `docs/HANDOVER_COST_DECOMPOSITION.md` (the decomposition method + price
calibration), and the auto-memory `project_per_cert_mapper_validation_state` (full slice log
+ deproven approaches).
## THE GOAL (unchanged, and we are FAR from it)
100% of API records with a lodged SAP must compute within **0.5 SAP** of the API's
`energy_rating_current`. `scripts/eval_api_sap_accuracy.py` headline (905 computed certs):
| metric | session-2 start | now (`4d1a58b8`) |
|--------|-----------------|------------------|
| **% \|err\| < 0.5** | 43.8% | **45.0%** |
| % \|err\| < 1.0 | | 59.4% |
| % \|err\| < 2.0 | | 77.6% |
| mean \|err\| | 2.01 | 1.757 |
| **mean signed** | 0.31 | **+0.019** |
| p99 \|err\| | | 17.2 |
| max \|err\| | | 61.4 |
**Be honest about where this is: 45% within 0.5 is poor.** The headline barely moved
(+1.2pp) across 6 fixes because each clean cause is small (10-30 certs). What DID change
decisively is the **signed bias: 0.31 → +0.02**. The systematic under-rating that defined
the sample at session start is gone the remaining error is **bidirectional scatter**, ~55%
of certs are >0.5 off in BOTH directions, and there is **no single lever left that moves the
headline by more than ~0.3pp.** Further progress is per-cause, and increasingly needs
worksheet ground truth (see "Why we need worksheets" below).
## WHAT SHIPPED THIS SESSION (7 commits, all green, pyright net-zero)
1. `98f71d25` **decomposition tool** `scripts/decompose_api_cost_error.py` — calibrates the
consumer price from accurate gas certs (gas £0.0809, elec £0.2839/kWh), predicts each
component cost, clusters by (component × direction). **CAVEAT: it uses the STANDARD elec
price, so it MIS-FLAGS off-peak-heated certs as `heat:high`.** For electric certs compare
against the cascade's own cost intermediates (`SapResult.intermediate['main_heating_cost_gbp']`
etc.), not the decomposition.
2. `bb830741` **sloping-ceiling**`roof_construction=8` carries `sloping_ceiling_insulation_thickness`
("100mm"); the mapper dropped it. Now fed → Table 17 col (1a). 9884 5.5 → +0.06.
3. `6b045146` **gas-boiler fuel from §14.2 mains-gas meter** (Summary/Elmhurst path) — a
Table-4b gas boiler with a SEPARATE electric immersion (§15 "Electricity") used to raise
`MissingMainFuelType`; now falls back to the "Main gas: Yes" meter flag → mains gas.
4. `3aed8f85` **floor "another dwelling below" (code 6)** — party floor, no heat loss
(mirror of the roof's "another dwelling above" override). 2115 floor 47.85→0 W/K, 23→4.
5. `a64e857b` **roof "Unknown insulation" → Table 18** (§5.11.4) — "NI"=Not Indicated
(undetermined), not zero; routes to age-band default not 2.30. Cluster mean|err| 7.8→1.8.
6. `678aa7af` **main-roof U ignores Room-in-Roof "no insulation" leak**`_joined_descriptions`
concatenated ALL roofs[], so an RR "no insulation" contaminated the main-roof U. Now drops
"Roof room(s)" entries for the main-roof U (RR shell unaffected; golden 6035 safe).
7. `4d1a58b8` **Unknown-meter + storage/CPSU → off-peak tariff** (§12) — storage heaters
charge overnight; an Unknown (code-3) meter no longer bills their charge at standard
13.19p. `rdsap_tariff_for_cert` infers off-peak for Rule-1 CPSU/Rule-2 storage only; and
`_fuel_cost` now uses `_rdsap_tariff` (not raw `tariff_from_meter_type`). 7336 26 → 0.16.
## DEPROVEN — do NOT retry (empirically failed this session)
- **roof `'ND'` (Not Determined) → Table 18.** `'ND'` is on ~305/905 certs and the lodged
calc genuinely uses the description's high U for many; routing all 'ND' to age-default broke
9 certs (some 0 → +15) for zero net gain. The description is load-bearing even with 'ND'.
(The narrow "**unknown**" word IS a clean signal — that's slice `a64e857b`.)
- **broad "all §12 Rule-3 electric → off-peak on Unknown meters".** Net-NEGATIVE (44.9→44.8,
bias flipped +0.16). Room-heater dwellings (code 691) over-credit when forced off-peak
(their electric-immersion HW goes off-peak). Direct-boiler 191 alone is +0.1 but requires a
191-vs-691 split that is NOT spec-grounded (both are Rule 3) — a population data-fit; left
unshipped on purpose (the user's principle: RdSAP is deterministic, no overfitting).
- **RR shell U Table-17-50mm** (from session 1, still true): golden 6035 disproves it.
## THE REMAINING CLUSTER MAP (where the error lives now)
Run `scripts/decompose_api_cost_error.py` for the live table. As of `4d1a58b8`:
| cluster | n | within 0.5 | note |
|---------|---|-----------|------|
| `heat:high` | 319 | 39% | we over-state heating energy (or off-peak mis-priced) |
| `heat:low` | 229 | 47% | we under-state heating energy |
| `hw:low` | 161 | 50% | |
| `hw:high` | 120 | 43% | |
| `balanced` | 76 | 55% | |
By dwelling type / system (from `_results.csv`):
- **Flats (prop 2): 283 certs, 31% within 0.5** — still the worst segment by far (houses 50%,
bungalows 59%). Signed 0.24. The fabric/tariff fixes helped but flats remain hardest.
- **Heat pumps (cat 4): 20 certs, 45% within 0.5, mean signed +1.43, mean|err| 3.81** — a
distinct OVER-rating cluster, UNTOUCHED this session. These have PCDB indices (e.g. 9472
+15.0 idx 104351, 2789 +13.4 idx 104632, 4135 +10.0 idx 106465). Likely an Appendix-N /
PCDB efficiency or HW-from-HP issue. **Good next target — it's a coherent over-rate cluster,
and HPs may be pinnable from a worksheet.**
- **Top single offenders** (see eval TOP-40): 2100 61 (n_bps=2, electric, prop 0), 2958 +32
(single-bp electric), 0390 29 (flat, "Flat no insulation"+ND roof — the deproven path),
2080 25 (electric direct-boiler flat — mixed cause), 7921 23 (gas, PCDB idx 16814).
## WHY WE NEED WORKSHEETS NOW (the user has accepted this)
The decomposition method got us the directional bias (under-rating → balanced). It is now
**exhausted for the bidirectional scatter** because:
1. For **electric/off-peak certs** the consumer-price `*_cost_current` fields diverge from the
SAP Table-12 prices the rating actually uses — the lodged total can EXCEED ours while the
lodged SAP is HIGHER. So we cannot back-calculate a reliable kWh/cost target.
2. The remaining causes (HW immersion off-peak charge-vs-on-demand split; HP Appendix-N
efficiency + HP-DHW; per-cert fabric like 2100's 61) are **sub-component values that the
±10% calibration cannot resolve** — they need a line-ref pin.
**What to generate (in priority order):** Elmhurst worksheets (P960 + Summary) for —
- **A heat pump cat-4 cert that over-rates**, e.g. `9472-3052-6202-0766-7200` (+15.0, idx
104351) or `2789-8331-7179-3314-1150` (+13.4). Pin §9b HP efficiency (Appendix N / Table
4a), the (206)/(207) seasonal eff, and HW-from-HP. This is the cleanest coherent cluster.
- **A meter-3 electric flat with electric-immersion HW**, e.g. `2474-3059-4202-4496-3200`
(13.3, cat-2 direct-boiler 191) or `2080` (25.5). Pin EXACTLY how RdSAP bills the
electric-immersion HW (§4 + Table 12a) and direct-acting heating on an off-peak tariff —
this resolves whether Rule-3 electric on Unknown meters should be off-peak (the unshipped
191 question) and the HW-off-peak split.
- (Optional) **2100-5421-0922-1622-3463** (61, the worst) — 2 building parts, electric; a
worksheet would localise whether it's a §3 geometry or heating blowup.
The faithful-reproduction rule still holds: **use the cert's OWN data** (its API JSON is in
`/tmp/epc_2026_sample/<cert>.json`; generate the Elmhurst worksheet from the same property),
NOT a template-edited 001431. Template edits drift (session-1 lesson).
## TOOLS & CONVENTIONS
- `PYTHONPATH=/workspaces/model python scripts/eval_api_sap_accuracy.py` — headline + TOP-40
+ per-cert `/tmp/epc_2026_sample/_results.csv`.
- `PYTHONPATH=/workspaces/model python scripts/decompose_api_cost_error.py` — component
clusters + `_cost_decomposition.csv` (remember the off-peak caveat above).
- Sample: ~1009 cached API JSONs at `/tmp/epc_2026_sample` (override `EPC_SAMPLE_CACHE`).
- **Conventions (non-negotiable):** one cause = one slice = one commit; **spec citation
(page+line)** in the message; AAA test headers; `abs(x-y)<=tol` not `pytest.approx`;
SAP 10.2 only; **no tolerance-widening / xfail**; pyright strict **net-zero** (baseline-
compare via `git stash`); **stage files BY NAME** (the tree carries unrelated `scripts/`
+ "sap worksheets/" changes — never `git add -A`); RdSAP is **deterministic** — every fix
must be a spec rule, not a population data-fit (the user is firm on this);
`Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>`.
- **REGRESSION after any calculator change:** `tests/domain/sap10_calculator/`,
`backend/documents_parser/tests/`, `datatypes/epc/`, and the golden fixtures (esp. **6035**).
- **Pre-existing failures to IGNORE** (fail on the stashed baseline too, NOT yours):
`test_from_rdsap_schema.py::…::test_total_floor_area`, and the 2 stone-wall U tests in
`domain/sap10_ml/tests/test_rdsap_uvalues.py` (`…stone_granite_thin_wall_age_a_120mm…`,
`…stone_sandstone…`) — likely fallout from the §5.7 wall-U slice `27375d93`; worth a
separate fix but not yours to count against net-zero.
## ARCHITECTURE NOTES THAT COST TIME (so you don't re-discover them)
- The API cost path uses `inputs.fuel_cost` (the Table-32/12a **precompute**, `_fuel_cost`),
NOT the scalar `space_heating_fuel_cost_gbp_per_kwh`. `calculator.py:540` picks the
precompute when populated, ELSE the legacy scalar fields. `_fuel_cost` returns a ZERO
sentinel for any off-peak tariff → the calculator then falls back to the legacy scalar
fields (which DO carry the off-peak rate from `_space_heating_fuel_cost_gbp_per_kwh`). So a
tariff change only bites if it flips `_fuel_cost`'s tariff off STANDARD.
- `_table_12a_system_for_main` maps cat-10 room heaters → `OTHER_DIRECT_ACTING_ELECTRIC` but
leaves storage (401-409, correct: → None → 100% low rate) and **direct-boiler 191 / CPSU as
TODO** (→ None → pure low rate, which OVER-credits 191 on off-peak). Wiring 191/CPSU rows is
a prerequisite if you ever revisit Rule-3-on-Unknown.
- Fuel codes stored on `SapResult` are the RAW API enum (26 = mains gas), not Table-12 codes
— translate via `table_12.API_FUEL_TO_TABLE_12` (the decomposition script does this).