Model/domain/sap10_calculator/docs/HANDOVER_API_SAMPLE_ACCURACY.md
Khalim Conn-Kowlessar 6d4fa5dd3b docs: handover — fold in S0380.232-234 (case-19 PV closure) + open causes
Records the three PV slices shipped (D_PV off-peak exclusion, weighted
dwelling import price, Appendix G4 diverter), the resulting case-19 state
(SAP 50.33→51.34, rounds to lodged 51), and the two remaining case-19
causes (winter Appendix-M EPV monthly shape; fabric (33) +1.0). Adds the
`2100-5421` worst-offender diagnosis (a 352 m² uninsulated solid-wall
dwelling on the as-built-insulated-assumed roof-U front, not a flats bug).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-04 23:06:20 +00:00

20 KiB
Raw Permalink Blame History

Handover — wide-scale API accuracy study + next steps

Point-in-time note. Start from AGENT_GUIDE.md for methodology, the 1e-4 bar, the per-line debugging loop, the section helpers, and the suite command.

  • Branch: feature/per-cert-mapper-validation
  • HEAD: 9521d524. Next SAP slice: S0380.235.
  • Baseline (§4 suite): tests/domain/sap10_calculator/ backend/documents_parser/tests/ → green (2412 passed, 1 skipped). Pre-existing out-of-scope failures unchanged (stone-§5.6 in domain/sap10_ml/tests/; test_from_rdsap_schema.py::...test_total_floor_area).

Shipped this session (S0380.232-234 — the case-19 PV closure)

The PV diverter (the prior handover's S0380.232 ask) needed two prerequisite spec bugs fixed first; all three landed:

slice commit spec what
S0380.232 212b0c92 App M1 §3a (p.93, l.5470-5476) D_PV excludes the LOW-rate portion of an off-peak electric main: (211) is only PV-eligible where its §10a code ∈ {30,32,34,35,38}. Storage heaters on 7-hr charge wholly at low rate → fraction 0.0 → excluded. β_Jan 0.894→0.792 (ws 0.791). New _main_space_heating_high_rate_fraction.
S0380.233 d4a8c02b App M1 §6 (p.94, l.5510-5513) PV-used-in-dwelling credited at the Table 12a ALL_OTHER_USES weighted rate (7-hr 14.311 p/kWh), not the bare low rate (5.50). Was under-crediting onsite PV on every off-peak PV cert. Delegates to _other_fuel_cost_gbp_per_kwh; STANDARD unchanged.
S0380.234 9521d524 Appendix G4 (p.72-73) The PV diverter. 3 layers: extractor Diverter present + schema pv_diverterpv_diverter_present flag (Elmhurst + API mappers) → _pv_diverter_monthly_kwh (SPV = export×0.8×0.9, clamp ≤ (62)+(63a), → (63b)m); cert_to_inputs recomputes (219) + PV export, β fixed pre-diverter.

Case 19 now: SAP cont 50.33 → 51.34 (ws 51.2221; both round to lodged 51), cost (255) 1847.5→1812.3 (ws 1816.6), CO2 3331→3120 (ws 3126), (233a) dwelling 1280.6 (ws 1280.4 — the β fix pins it). The diverter formula is exact in summer (Jun SPV 186.07 = export×0.72, matches ws (63b)).

The remaining +0.11 SAP on case 19 = two separate, still-open causes:

  1. Winter Appendix-M monthly EPV shape. Our annual EPV (2684.17) matches the worksheet exactly and Jun-Sep match per-month exactly, but Jan-May/Oct-Dec our EPV is ~9-11% LOW (worksheet Jan 68.2 vs ours 62.5). Back-solve: ws EPV_m = |(233a)_m| + |(63b)_m|/0.72. This under-diverts in winter → export (233b) 280.7 vs ws 184.2, and (219) 3322 vs ws 3188. A two-array PV apportionment issue (case 19 has SE + NW arrays with different overshading) — chase in §M / Appendix U monthly radiation, NOT the diverter (which is validated).
  2. Fabric (33) +1.0 W/K (ours 305.04 vs ws 304.04) — a single element off by exactly 1.0; floor=25.000 is suspiciously round. Walk the per-element §3 breakdown.

The eval headline is flat (42.9→43.0% <0.5; cat-7 5.25→4.93) — expected: the diverter is rare and the β/price effects are small on the rounded SAP. The value was pinning the worksheet-validated case 19 + fixing three real spec bugs that the curated cohort masked.

Headline now (1,000-cert 2026 API sample, HEAD f326e4eb)

metric value was (handover baseline 9c0a373f)
computed 882 / 1000 882
% |err| < 0.5 42.9% 41.8%
% < 1 / < 2 / < 5 56.7% / 74.6% / 90.1% 54.9 / 71.9 / 87.8
median / mean |err| 0.73 / 2.04 0.79 / ~2.4
mean signed 0.41 +0.2

Error by heating cluster (the load-bearing cut — re-run analyse_api_sap_clusters.py):

cluster n mean |err| %<0.5 note
cat 2 gas boiler + PCDB 639 1.27 49.6% well-trodden
cat 2 gas, NO PCDB idx 91 3.18 35.2% non-PCDB Table-4b boilers
cat 6 community 45 2.59 31.1% known-hard
cat 7 electric storage 40 5.25 10.0% was 7.33 → S0380.227-229
cat 10 electric room heaters 48 5.26 16.7% was 9.49 → S0380.230-231 (bias gone)
cat 4 HP + PCDB 8 6.11 12.5% small n, APM
Flats (any) 282 2.57 30.5% geometry / communal
real PV 45 3.90 26.7% Appendix M

Worst individual offenders (the long tail — eval TOP 40): 2100-5421-0922-1622-3463 (60.8, our SAP negative 24.8 vs lodged 36 — a flat, 2 bps, cat-2; the single worst, likely a geometry/communal blow-up — START a per-cert dig here), 2958-8008 (+32, age 6=tiny), 9836-5829 (29.5, cat-10 tail), several cat-7/cat-10 in the 20s.

Work shipped (this session — S0380.227-231 + 3 mapper commits)

commit what
S0380.227 dedicated DHW-only system (WHS 911) is NOT separately timed → no Table 2b ×0.9; TF (53) 0.54→0.60, (59) h=3→5
S0380.228 electric SECONDARY on off-peak bills at Table 12a OTHER_DIRECT_ACTING_ELECTRIC (1.00 high-frac), not 100% low
S0380.229 dedicated water boiler/circulator (WHC 911-931) feeds cylinder via primary loop → Table 3 primary loss applies
S0380.230 electric room heaters (cat 10) on off-peak → OTHER_DIRECT_ACTING_ELECTRIC (mirror of .228 for the MAIN). cat-10 9.49→7.11
S0380.231 Dual-meter electric room heaters → 10-hour tariff (RdSAP §12 Rule 3; codes 691-694,699). cat-10 7.11→5.26, bias +5.08→0.86
bd25a3c7 SY system-built vs B basement: code 6 stays system-built; basement → explicit wall_is_basement/is_basement flag. system_build is a derived property (wall type). API path post-processes via addendum. (issue #1177 — see docs/PR_NOTE_system_built_basement_1177.md: field-vs-property merge landmine)
f326e4eb Elmhurst path now populates roof_construction (int) via _elmhurst_roof_construction_int for cross-mapper parity (API set it, Elmhurst didn't)
  • Toolkit (committed): scripts/fetch_2026_epc_sample.py, scripts/eval_api_sap_accuracy.py, scripts/analyse_api_sap_clusters.py. The 1,000 cached JSONs live in /tmp/epc_2026_sample/ (gitignored scratch — re-fetch with the sampler; EPC_SAMPLE_CACHE overrides the dir). Re-run the eval after any mapper/calculator change to watch the headline move.

What this study did

Fetched a random 1,000-cert sample of domestic EPCs lodged JanMay 2026 from the GOV.UK EPB register (the /api/domestic/search date-windowed endpoint to enumerate cert numbers across random pages → /api/certificate per cert for the full schema-21 JSON), ran each through the API path (from_api_response → cert_to_inputs → continuous SAP), and compared to the lodged rounded energy_rating_current.

This is the first measurement of raw-API behaviour on an unbiased population — the curated golden cohort (~exact) masked it.

Reproduce

  • Sampler/fetcher: /tmp/sample_fetch_2026.py → caches JSONs to /tmp/epc_2026_sample/.
  • Evaluator: /tmp/eval_sap_accuracy.py → per-cert CSV + summary (% <0.5, buckets, worst-40, raise breakdown). Cluster analysis: /tmp/analyze2.py. (Token in backend/.env OPEN_EPC_API_TOKEN; date_end must be < today.)
  • These scripts are uncommitted (in /tmp). Worth promoting to scripts/ if this becomes a recurring measurement.

Headline (at HEAD 9c0a373f)

metric value
computed 882 / 1000 (100 unsupported pre-21 schema; 18 still raise)
% |err| < 0.5 (of computed) 41.8%
% < 1.0 / < 2.0 / < 5.0 54.9% / 71.9% / 87.8%
median / mean |err| 0.79 / ~2.4
mean signed err +0.2 (slight over-rate)

Accuracy is dominated by heating type (the load-bearing cut):

main_heating_category n mean |err| %<0.5 status
2 = gas boiler (PCDB-indexed) 579 1.30 48% the well-trodden path
7 = electric storage heaters 39 7.33 3% broken — #1 lever
10 = electric room heaters 43 10.26 9% broken — #2 lever
6 = community scheme 38 2.28 34% known-hard
Flats (any heating) 242 3.19 29% geometry + communal

Work shipped this session (S0380.219225)

Coverage unblocked 788 → 882 computed (+94); one real accuracy bug fixed (+22 certs).

slice fix certs
S0380.219 floor_construction 3 → "Suspended, not timber" (RdSAP 10 field 3-1) ~44
S0380.220 floor_construction 0 → None (Table 19 unknown; proven inert) 37
S0380.221 default missing post_town (unused metadata) 1
S0380.222 roof_construction 6 (thatched) + 7 (dwelling above) → None (inert) 5
S0380.223 _part_geometry early-return key contract (RR KeyError) 5
S0380.224 loose-jacket cylinder storage loss (Table 2 Note 1) — was None'd out → zero loss 22 (mean err +2.29 → +0.45)
S0380.225 §10.7 no-water-heating default A-F → 12mm loose jacket 2
S0380.226 Elmhurst "Jacket" cylinder insulation → loose-jacket code 2 (Summary path) (unblocked case 19)

Headline at HEAD: 882 / 1000 computed, 41.8% < 0.5 (re-run the eval to refresh).


★ Active worksheet: simulated case 19 — the electric-storage-heater debug

The user generated sap worksheets/golden fixture debugging/simulated case 19/ (Summary_001431 (2).pdf + P960-0001-001431 - 2026-06-04T174437.228.pdf), purpose-built to hit the #1 cluster. It exercises electric storage heaters (SAP code 402, control 2402 auto-charge, 7-hr off-peak tariff) + a loose-jacket 210 L cylinder + WHS 911 (gas boiler for water only) + room-in-roof gables (Party + Exposed) + an alternative wall + exposed floor + electric secondary.

S0380.226 unblocked extraction (the "Jacket" label was raising). The worksheet has FOUR blocks: block 1 = rating (UK-avg region 0; cost (255)=1816.58, SAP (258)=51, TF (53)=0.60, (51)=0.0330), block 2 = demand (postcode; CO2 (272)=3125.85, PE (286)=30271.76), blocks 3/4 = the potential/improved variants. Pin the rating block for SAP/cost, the demand block for PE/CO2. Worksheet header line 116 lodges "Separate Time Control: No" (NOT in the Summary §15 PDF — only in the P960 header).

Three slices shipped (S0380.227229) — closed the +9 cluster signature; SAP cont 60.2 → 50.33 (worksheet ~51.22):

slice line ref fix SAP cont
S0380.227 TF (53) 0.54→0.60; (59) h=3→h=5 dedicated DHW-only system (WHS 911) is NOT separately timed → no Table 2b ×0.9 (RdSAP 10 §10.5.1). _separately_timed_dhw gated on WHC ∈ {901,902,914}. Worksheet-pins S0380.224's loose-jacket (51)=0.0330/(53)=0.60/(55)=3.4531/(56-57)Jan=107.0456 at 1e-4. 60.2→60.1
S0380.228 cost (255) electric SECONDARY on off-peak bills at Table 12a OTHER_DIRECT_ACTING_ELECTRIC (7-hr high-frac 1.00 = £0.1529), not the flat off-peak low (£0.0550). Worksheet (242): "1.0015.29 + 0.005.50". THE primary cost driver (340). 60.1→50.67
S0380.229 (62) 2493.30→3169.98 dedicated water-heating boiler/circulator (WHC 911-931) feeds the cylinder via a primary loop → Table 3 row 1 primary loss applies (keyed off water_heating_code, since _water_heating_main returns the electric SPACE main). Restored the missing (59)=676.68 kWh/yr. 50.67→50.33

The ONE remaining case-19 cause — the PV diverter (63b) — is S0380.232. Worksheet header line 124 "Diverter = Yes"; Summary §19 "Diverter present: Yes". Per SAP 10.2 Appendix G4 (PDF p.72-73) surplus PV is diverted to the cylinder immersion: S_PV,diverter,m = EPV,m × (1 βm) × 0.8 × 0.9, clamped to ≤ (62)m + (63a)m, entered as a NEGATIVE (63b)m. (64)m = (62)m + (63a)m + (63b)m + … → (219)m = (64)m / eff. All four G4 inclusion conditions are met (PV connected to dwelling; cylinder 210 L > (43)=74.24; no solar HW; no battery). Worksheet (63b) annual ≈ 1097.67 kWh → (64) drops 3169.98 → 2072.31, (219) 4876.9 → 3188.17. It ALSO changes the PV β-split (export drops: worksheet dwelling 1280.39 / exported 184.16 vs our 1496.20 / 1187.98 with no diverter). This is a 3-layer feature (extractor Diverter present → mapper flag → calculator (63b) + β-split interaction) — implement as one focused slice. Spec note p.5485: for the β calc, (219)m must EXCLUDE the diverter saving.

Smaller residuals after the diverter lands: main fuel (211) ours 20250.22 vs ws 19910.30 (+340), secondary (215) 3573.57 vs 3513.58 (+60), fabric (33) +1.0 (gable/alt-wall). Current demand block: CO2 (272) 3331.04 vs 3125.85, PE (286) 31653.23 vs 30271.76 — both will drop with the diverter (less grid import).

Debug recipe (reuse the /tmp/case19*.py throwaways or rebuild):

pages  ElmhurstSiteNotesExtractor(...).extract()  from_elmhurst_site_notes
 cert_to_inputs / cert_to_demand_inputs  calculate_sap_from_inputs
# CI._cylinder_storage_loss_override(epc, main) → (57)m; CI._primary_loss_override(epc, age) → (59)m
# CI._water_heating_worksheet_and_gains(epc=…, water_efficiency_pct=0.65, is_instantaneous=False,
#   primary_age=<band>, pcdb_record=None) → wh_result with (45)/(46)/(57)/(59)/(62)/(64)

Remaining work, prioritised

A. Accuracy clusters (highest value)

  1. PV diverter (S0380.232) — closes case 19 to 1e-4 AND helps the real-PV cluster (45 certs, mean 3.90). Fully spec'd in the case-19 section above (Appendix G4). Has a worksheet → 1e-4 bar. Do this first: it's the one open cause on a validated worksheet.
  2. Electric storage heaters (cat 7, 40 certs, mean 5.25). S0380.227-229 took it 7.33→5.25; the case-19 PV diverter will help further. Beyond that the tail is per-cert — a dedicated cat-7 worksheet (no PV, no diverter) would let you pin charge-control / responsiveness at 1e-4 instead of the ±0.5 lodged fallback.
  3. Electric room heaters (cat 10, 48 certs, mean 5.26). S0380.230-231 fixed the systematic tariff bias (mean 9.49→5.26, signed +5.08→0.86); the residual is now scattered per-cert (e.g. 9836-5829 29.5, an under-rater). A cat-10 worksheet pins the tail at 1e-4.
  4. Non-PCDB gas boilers (cat 2, no idx, 91 certs, mean 3.18) and Flats (282, mean 2.57) — the next volume levers once the electric clusters are worksheet-pinned. Flats = geometry / communal; start with the worst (2100-5421 negative SAP).
    • 2100-5421-0922-1622-3463 diagnosed (S0380.234 session): NOT a flat — property_type 0, a 352 m² 2-storey uninsulated solid-wall dwelling (wall_constr 3 / wall_ins 4 as-built; roof_type 4, no roof insulation). Our space-heating demand is 71,084 kWh/yr → (37)=995.93 W/K → SAP 24.8 (lodged 36), cost £14,045. This is the as-built insulated-assumed U-value front (project_as_built_insulated_assumed_bug; S0380.209 fixed walls, "roof next"): the uninsulated-roof / as-built U over-estimates demand on big old dwellings. API-only (no worksheet → ±0.5 lodged fallback only); needs a generated worksheet or a roof-U spec audit to pin. It is one outlier, not a cluster-wide flats bug.

B. Remaining raises (16 certs — all U-value / heat-loss-sensitive, NOT enum guesses)

  • gable_wall_type 2 & 3 (14 certs). RdSAP 10 Table 4 RR walls: 0=Party (U=0.25), 1=Exposed (U=common wall), 2/3 = Sheltered (U=external×R0.5) + Adjacent-to-heated (U=0), code↔type order unconfirmed (schema says "not yet seen"). Needs (i) a worksheet to pin which code is which + the U-values, and (ii) calculator support — the cascade only has gable_wall/gable_wall_external kinds; Sheltered (R=0.5) and Adjacent (U=0) are new. Best real example: 2818-3053-3203-2655-9204 lodges BOTH gable 2 and 3.
  • main_heating_category: 9 = warm air, mains gas (1 cert). Needs §9 warm-air dispatch.
  • wall_insulation_thermal_conductivity 3 (1 cert). Verified it shifts wall U (53.96→51.61 across λ) → worksheet-backed (the resolver's own discipline).
  • floor_heat_loss 8 (2 certs). Semantically unconfirmed; inert for the 2 observed (non-Main bp) but potentially "heated space below" (→ should exclude the floor, a calculator change). Don't guess.

The clean mapper-enum raises are exhausted — every remaining raise changes the answer, which is what the strict-raise guard exists to prevent.


★ Additional worksheets that would help most (the user will generate these on request)

The two electric clusters are now systematic-bias-free (S0380.227-231) but their TAILS sit at the ±0.5-vs-lodged fallback bar because no worksheet validates them at 1e-4. The three highest-value worksheets to ask the user for:

  1. An electric ROOM-heater dwelling (SAP code ~691, control 2601/2602/2603, Dual meter) — pins the cat-10 tail (48 certs, mean 5.26) at 1e-4. Make it PV-free + cylinder-free to isolate the space-heat path from the diverter/HW.
  2. An electric STORAGE-heater dwelling distinct from case 19 (no PV, no WHS-911) — pins the cat-7 tail (40 certs, mean 5.25): charge control (2401/2402), 7-hr vs 24-hr, responsiveness.
  3. A room-in-roof with a SHELTERED gable and an ADJACENT-TO-HEATED gable (Table 4 types beyond Party/Exposed) — closes the gable_wall_type 2/3 raise (14 certs) and pins the Sheltered (U=ext×R0.5) / Adjacent (U=0) U-values the calculator must add.

Per worksheet send BOTH the Summary PDF (input) and the P960/dr87 worksheet PDF (the (1)..(286) ground truth). Drop them in sap worksheets/golden fixture debugging/<name>/ and run the case-19 debug recipe.

The original "design one property" guidance (kept below for reference) is what case 19 was built from.

What to generate — the single most productive worksheet (reference)

Heating is one-per-property, so one worksheet can't cover all four broken heating types. But fabric is independent of heating, so the highest-ROI single artifact bundles the #1 accuracy cluster with the fabric that closes the gable raises and pins the loose-jacket fix.

Build (in Elmhurst, a simulated case is fine — same as the existing simulated case N worksheets) ONE property:

A house heated by ELECTRIC STORAGE HEATERS, with a room-in-roof and a hot-water cylinder:

  • Heating: electric storage heaters (off-peak / Economy-7 tariff), with a clear control type. This is the load-bearing choice — it validates the 39-cert cat-7 cluster.
  • Hot water: a cylinder with a loose-jacket insulation (not factory foam), a stated jacket thickness, and a cylinder thermostat. Pins S0380.224's loose-jacket storage loss (56)m at 1e-4 — currently only direction-validated.
  • Room-in-roof with two gable walls of different types — ideally one "Sheltered" and one "Adjacent to another heated space" (plus, if the tool allows, a Party and an Exposed gable). Gives the Table 4 U-values for gable_wall_type 2 & 3 and disambiguates the code order — closes the 14-cert raise.
  • An extension (2nd building part) with a different floor exposure (e.g. over unheated space or "to external air"). Exercises multi-bp geometry + floor-exposure handling.

From that single worksheet I can pin, at 1e-4: the electric-storage space-heating lines ((210)/(211)/space-heat), the loose-jacket storage loss (56)m, the RR gable U-values (30)/(32), and the multi-bp fabric (27)(37). That's one cluster + one fix-validation + the biggest raise + fabric, all in one document.

If you'd rather do two: add a second worksheet that is identical but with electric room heaters instead of storage heaters — together they cover cat 7 + cat 10 (≈ 82 certs, the two worst clusters). A third for a community-heating flat would cover cat 6 + the flat geometry cluster.

Then send me, per worksheet

The Summary PDF (the Elmhurst input/site-notes) + the worksheet PDF (the (1)..(286) ground truth). With those I run both front-ends through the cascade and pin each line ref at 1e-4, exactly as for the with api 3 pair (S0380.218).


Conventions (unchanged)

One cause = one slice = one commit; spec citation (page+line) in the message; AAA tests (# Arrange / # Act / # Assert); abs(x - y) <= tol (not pytest.approx); SAP 10.2 only; no tolerance widening / xfail / rel-tol. New code passes pyright strict with ZERO NEW errors (baseline-compare with git stash; mapper.py / cert_to_inputs.py / heat_transmission.py carry pre-existing errors — compare counts). Stage files by name (the tree has unrelated pytest.ini/scripts/ changes that must NOT be staged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>.