Model/docs/HANDOVER_API_ACCURACY_S2.md
Khalim Conn-Kowlessar fb350036b1 docs: session-2 API-accuracy handover (fabric+tariff fixes, worksheet path)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-06 19:13:21 +00:00

10 KiB
Raw Permalink Blame History

Handover — API SAP accuracy (session 2): fabric + tariff fixes, and why we now need worksheets

Branch: feature/per-cert-mapper-validation (long-lived working branch — NEVER PR to main; the user pushes/PRs when ready). HEAD 4d1a58b8, local-only ahead of origin.

READ ALSO: docs/HANDOVER_COST_DECOMPOSITION.md (the decomposition method + price calibration), and the auto-memory project_per_cert_mapper_validation_state (full slice log

  • deproven approaches).

THE GOAL (unchanged, and we are FAR from it)

100% of API records with a lodged SAP must compute within 0.5 SAP of the API's energy_rating_current. scripts/eval_api_sap_accuracy.py headline (905 computed certs):

metric session-2 start now (4d1a58b8)
% |err| < 0.5 43.8% 45.0%
% |err| < 1.0 59.4%
% |err| < 2.0 77.6%
mean |err| 2.01 1.757
mean signed 0.31 +0.019
p99 |err| 17.2
max |err| 61.4

Be honest about where this is: 45% within 0.5 is poor. The headline barely moved (+1.2pp) across 6 fixes because each clean cause is small (10-30 certs). What DID change decisively is the signed bias: 0.31 → +0.02. The systematic under-rating that defined the sample at session start is gone — the remaining error is bidirectional scatter, ~55% of certs are >0.5 off in BOTH directions, and there is no single lever left that moves the headline by more than ~0.3pp. Further progress is per-cause, and increasingly needs worksheet ground truth (see "Why we need worksheets" below).

WHAT SHIPPED THIS SESSION (7 commits, all green, pyright net-zero)

  1. 98f71d25 decomposition tool scripts/decompose_api_cost_error.py — calibrates the consumer price from accurate gas certs (gas £0.0809, elec £0.2839/kWh), predicts each component cost, clusters by (component × direction). CAVEAT: it uses the STANDARD elec price, so it MIS-FLAGS off-peak-heated certs as heat:high. For electric certs compare against the cascade's own cost intermediates (SapResult.intermediate['main_heating_cost_gbp'] etc.), not the decomposition.
  2. bb830741 sloping-ceilingroof_construction=8 carries sloping_ceiling_insulation_thickness ("100mm"); the mapper dropped it. Now fed → Table 17 col (1a). 9884 5.5 → +0.06.
  3. 6b045146 gas-boiler fuel from §14.2 mains-gas meter (Summary/Elmhurst path) — a Table-4b gas boiler with a SEPARATE electric immersion (§15 "Electricity") used to raise MissingMainFuelType; now falls back to the "Main gas: Yes" meter flag → mains gas.
  4. 3aed8f85 floor "another dwelling below" (code 6) — party floor, no heat loss (mirror of the roof's "another dwelling above" override). 2115 floor 47.85→0 W/K, 23→4.
  5. a64e857b roof "Unknown insulation" → Table 18 (§5.11.4) — "NI"=Not Indicated (undetermined), not zero; routes to age-band default not 2.30. Cluster mean|err| 7.8→1.8.
  6. 678aa7af main-roof U ignores Room-in-Roof "no insulation" leak_joined_descriptions concatenated ALL roofs[], so an RR "no insulation" contaminated the main-roof U. Now drops "Roof room(s)" entries for the main-roof U (RR shell unaffected; golden 6035 safe).
  7. 4d1a58b8 Unknown-meter + storage/CPSU → off-peak tariff (§12) — storage heaters charge overnight; an Unknown (code-3) meter no longer bills their charge at standard 13.19p. rdsap_tariff_for_cert infers off-peak for Rule-1 CPSU/Rule-2 storage only; and _fuel_cost now uses _rdsap_tariff (not raw tariff_from_meter_type). 7336 26 → 0.16.

DEPROVEN — do NOT retry (empirically failed this session)

  • roof 'ND' (Not Determined) → Table 18. 'ND' is on ~305/905 certs and the lodged calc genuinely uses the description's high U for many; routing all 'ND' to age-default broke 9 certs (some 0 → +15) for zero net gain. The description is load-bearing even with 'ND'. (The narrow "unknown" word IS a clean signal — that's slice a64e857b.)
  • broad "all §12 Rule-3 electric → off-peak on Unknown meters". Net-NEGATIVE (44.9→44.8, bias flipped +0.16). Room-heater dwellings (code 691) over-credit when forced off-peak (their electric-immersion HW goes off-peak). Direct-boiler 191 alone is +0.1 but requires a 191-vs-691 split that is NOT spec-grounded (both are Rule 3) — a population data-fit; left unshipped on purpose (the user's principle: RdSAP is deterministic, no overfitting).
  • RR shell U Table-17-50mm (from session 1, still true): golden 6035 disproves it.

THE REMAINING CLUSTER MAP (where the error lives now)

Run scripts/decompose_api_cost_error.py for the live table. As of 4d1a58b8:

cluster n within 0.5 note
heat:high 319 39% we over-state heating energy (or off-peak mis-priced)
heat:low 229 47% we under-state heating energy
hw:low 161 50%
hw:high 120 43%
balanced 76 55%

By dwelling type / system (from _results.csv):

  • Flats (prop 2): 283 certs, 31% within 0.5 — still the worst segment by far (houses 50%, bungalows 59%). Signed 0.24. The fabric/tariff fixes helped but flats remain hardest.
  • Heat pumps (cat 4): 20 certs, 45% within 0.5, mean signed +1.43, mean|err| 3.81 — a distinct OVER-rating cluster, UNTOUCHED this session. These have PCDB indices (e.g. 9472 +15.0 idx 104351, 2789 +13.4 idx 104632, 4135 +10.0 idx 106465). Likely an Appendix-N / PCDB efficiency or HW-from-HP issue. Good next target — it's a coherent over-rate cluster, and HPs may be pinnable from a worksheet.
  • Top single offenders (see eval TOP-40): 2100 61 (n_bps=2, electric, prop 0), 2958 +32 (single-bp electric), 0390 29 (flat, "Flat no insulation"+ND roof — the deproven path), 2080 25 (electric direct-boiler flat — mixed cause), 7921 23 (gas, PCDB idx 16814).

WHY WE NEED WORKSHEETS NOW (the user has accepted this)

The decomposition method got us the directional bias (under-rating → balanced). It is now exhausted for the bidirectional scatter because:

  1. For electric/off-peak certs the consumer-price *_cost_current fields diverge from the SAP Table-12 prices the rating actually uses — the lodged total can EXCEED ours while the lodged SAP is HIGHER. So we cannot back-calculate a reliable kWh/cost target.
  2. The remaining causes (HW immersion off-peak charge-vs-on-demand split; HP Appendix-N efficiency + HP-DHW; per-cert fabric like 2100's 61) are sub-component values that the ±10% calibration cannot resolve — they need a line-ref pin.

What to generate (in priority order): Elmhurst worksheets (P960 + Summary) for —

  • A heat pump cat-4 cert that over-rates, e.g. 9472-3052-6202-0766-7200 (+15.0, idx 104351) or 2789-8331-7179-3314-1150 (+13.4). Pin §9b HP efficiency (Appendix N / Table 4a), the (206)/(207) seasonal eff, and HW-from-HP. This is the cleanest coherent cluster.
  • A meter-3 electric flat with electric-immersion HW, e.g. 2474-3059-4202-4496-3200 (13.3, cat-2 direct-boiler 191) or 2080 (25.5). Pin EXACTLY how RdSAP bills the electric-immersion HW (§4 + Table 12a) and direct-acting heating on an off-peak tariff — this resolves whether Rule-3 electric on Unknown meters should be off-peak (the unshipped 191 question) and the HW-off-peak split.
  • (Optional) 2100-5421-0922-1622-3463 (61, the worst) — 2 building parts, electric; a worksheet would localise whether it's a §3 geometry or heating blowup.

The faithful-reproduction rule still holds: use the cert's OWN data (its API JSON is in /tmp/epc_2026_sample/<cert>.json; generate the Elmhurst worksheet from the same property), NOT a template-edited 001431. Template edits drift (session-1 lesson).

TOOLS & CONVENTIONS

  • PYTHONPATH=/workspaces/model python scripts/eval_api_sap_accuracy.py — headline + TOP-40
    • per-cert /tmp/epc_2026_sample/_results.csv.
  • PYTHONPATH=/workspaces/model python scripts/decompose_api_cost_error.py — component clusters + _cost_decomposition.csv (remember the off-peak caveat above).
  • Sample: ~1009 cached API JSONs at /tmp/epc_2026_sample (override EPC_SAMPLE_CACHE).
  • Conventions (non-negotiable): one cause = one slice = one commit; spec citation (page+line) in the message; AAA test headers; abs(x-y)<=tol not pytest.approx; SAP 10.2 only; no tolerance-widening / xfail; pyright strict net-zero (baseline- compare via git stash); stage files BY NAME (the tree carries unrelated scripts/
    • "sap worksheets/" changes — never git add -A); RdSAP is deterministic — every fix must be a spec rule, not a population data-fit (the user is firm on this); Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>.
  • REGRESSION after any calculator change: tests/domain/sap10_calculator/, backend/documents_parser/tests/, datatypes/epc/, and the golden fixtures (esp. 6035).
  • Pre-existing failures to IGNORE (fail on the stashed baseline too, NOT yours): test_from_rdsap_schema.py::…::test_total_floor_area, and the 2 stone-wall U tests in domain/sap10_ml/tests/test_rdsap_uvalues.py (…stone_granite_thin_wall_age_a_120mm…, …stone_sandstone…) — likely fallout from the §5.7 wall-U slice 27375d93; worth a separate fix but not yours to count against net-zero.

ARCHITECTURE NOTES THAT COST TIME (so you don't re-discover them)

  • The API cost path uses inputs.fuel_cost (the Table-32/12a precompute, _fuel_cost), NOT the scalar space_heating_fuel_cost_gbp_per_kwh. calculator.py:540 picks the precompute when populated, ELSE the legacy scalar fields. _fuel_cost returns a ZERO sentinel for any off-peak tariff → the calculator then falls back to the legacy scalar fields (which DO carry the off-peak rate from _space_heating_fuel_cost_gbp_per_kwh). So a tariff change only bites if it flips _fuel_cost's tariff off STANDARD.
  • _table_12a_system_for_main maps cat-10 room heaters → OTHER_DIRECT_ACTING_ELECTRIC but leaves storage (401-409, correct: → None → 100% low rate) and direct-boiler 191 / CPSU as TODO (→ None → pure low rate, which OVER-credits 191 on off-peak). Wiring 191/CPSU rows is a prerequisite if you ever revisit Rule-3-on-Unknown.
  • Fuel codes stored on SapResult are the RAW API enum (26 = mains gas), not Table-12 codes — translate via table_12.API_FUEL_TO_TABLE_12 (the decomposition script does this).