Model/docs/sap-spec/PARITY_FINDINGS.md
Khalim Conn-Kowlessar dde8ae30fa S-B2: parity probe + first-pass findings (100-cert baseline)
Adds services/ml_training_data/src/ml_training_data/sap_parity_probe.py
— samples N certs from the v18a corpus, streams them via BulkZipReader,
runs Sap10Calculator, prints MAE/RMSE/bias + worst-N residuals. Baseline
across 100 certs: MAE 8.41, RMSE 13.98, bias -2.65, 0 errors.

docs/sap-spec/PARITY_FINDINGS.md captures the dominant failure pattern
(flats + bungalows under-predicted, 10 of the worst-15 are flats whose
floor/roof are party with neighbouring dwellings) and the priority-
ordered Session B iteration backlog (S-B-flat-surfaces first).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-18 13:59:23 +00:00

3.9 KiB
Raw Blame History

Sap10Calculator parity probe — findings as of 2026-05-18

100-cert random sample from data/ml_training/runs/2025_2026_n250000_v18a/data.parquet, filtered to cert sap-score 20-95 (typical band). 0 errors — calculator runs end-to-end on every cert.

Headline

Metric Value
MAE 8.41 SAP-points
RMSE 13.98
Bias -2.65 (slight under-prediction)
Within ±1 18.0%
Within ±3 36.0%
Within ±5 57.0%
Within ±10 84.0%
Worst residual -56 SAP-points

Session B success criterion is MAE ≤ 1.0 on the typical subset; we're 8× that on the first pass, which roughly matches ADR-0009's expectation that the first run shakes out spec-interpretation gaps.

Dominant failure shape: flats and bungalows under-predicted

10 of the 15 worst residuals are flats or bungalows. Pattern: calculator charges floor + roof heat loss to dwellings that don't have exposed floor / roof surfaces (mid-floor flats, top-floor flats with party ceiling, etc.).

Worst 15 (residual = predicted actual):

Cert actual predicted residual TFA dwelling
0320-2756-7670-2196-2035 78 22 -56 57 Semi-detached bungalow
0036-1125-8600-0165-2206 63 18 -45 42 Mid-floor flat
0340-2394-5510-2925-4421 75 35 -40 73 Mid-floor flat
9360-2179-9590-2495-2615 78 39 -39 54 Ground-floor flat
0036-0529-1500-0700-8276 75 36 -39 47 Top-floor flat
0350-2182-9590-2526-7841 43 4 -39 119 Top-floor flat
2148-3061-6204-0016-7204 81 44 -37 67 Mid-floor flat
0800-1364-0922-4522-3963 71 37 -34 70 Detached bungalow
2110-6453-5050-8205-9605 63 31 -32 43 Ground-floor maisonette
2903-8339-6962-6004-0725 75 47 -28 11 Top-floor flat
0320-2850-3380-2125-1661 70 48 -22 45 Semi-detached bungalow
8035-9023-1500-0237-3226 43 63 +20 64 Detached bungalow
9590-7751-0022-0599-3953 51 69 +18 74 Detached house
2118-1198-2619-1711-7960 62 46 -16 42 Mid-floor flat
3336-3822-5500-0437-9202 70 59 -11 73 Mid-floor maisonette

Session B iteration backlog (priority order)

  1. S-B-flat-surfaces — Map dwelling_type to exposed floor/roof flags. Mid/top flats lose their u_floor × ground_floor_area; mid/ground flats lose their u_roof × top_floor_area. Expected impact: closes most of the 20 to 56 residuals.
  2. S-B-heating-eff-fallback — When sap_main_heating_code is None, fall back through main_heating_category + age band to a modern-condensing-boiler efficiency, not the legacy 0.80. ~28% of our 100-cert sample had a null code with category=2.
  3. S-B-electric-storage-tariff — Electric storage heaters (codes 401-409) should price space-heating fuel at Economy-7 low rate (Table 32 code 31, ~5.5 p/kWh), not standard rate 30. This is a 2× cost reduction on those certs.
  4. S-B-wall-uvalue-cascade-review — Worst non-flat residuals suggest the wall U-value cascade is too conservative for recently-built / well-insulated stock. Review domain.ml.rdsap_uvalues.u_wall against RdSAP 10 Table 5.
  5. S-B-bungalow-investigation — Bungalow residuals don't fit the flat-surfaces pattern (bungalows have full floor+roof). Hypothesis: thermal-bridging y-factor + storey-count interaction over-counts envelope. Probe specifically before deciding.
  6. S-B-pump-fan-default — We default to 130 kWh/yr; SAP 10.3 Table 4f says higher for systems with mechanical ventilation. Marginal but consistent.

How to reproduce

python adhoc/sap_calculator/probe_n.py            # 100 certs, seed=7
python adhoc/sap_calculator/probe_n.py 500 13     # bigger sample
python adhoc/sap_calculator/probe_worst.py        # detailed cert-by-cert dump

probe_n.py runs in ~80s. Errors: 0/100. Mapper handles every real cert shape encountered.