mirror of https://github.com/Hestia-Homes/Model.git synced 2026-06-08 11:17:27 +00:00

Khalim Conn-Kowlessar dde8ae30fa S-B2: parity probe + first-pass findings (100-cert baseline)

Adds services/ml_training_data/src/ml_training_data/sap_parity_probe.py
— samples N certs from the v18a corpus, streams them via BulkZipReader,
runs Sap10Calculator, prints MAE/RMSE/bias + worst-N residuals. Baseline
across 100 certs: MAE 8.41, RMSE 13.98, bias -2.65, 0 errors.

docs/sap-spec/PARITY_FINDINGS.md captures the dominant failure pattern
(flats + bungalows under-predicted, 10 of the worst-15 are flats whose
floor/roof are party with neighbouring dwellings) and the priority-
ordered Session B iteration backlog (S-B-flat-surfaces first).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-18 13:59:23 +00:00

3.9 KiB

Raw Blame History

Sap10Calculator parity probe — findings as of 2026-05-18

100-cert random sample from data/ml_training/runs/2025_2026_n250000_v18a/data.parquet, filtered to cert sap-score 20-95 (typical band). 0 errors — calculator runs end-to-end on every cert.

Headline

Metric	Value
MAE	8.41 SAP-points
RMSE	13.98
Bias	-2.65 (slight under-prediction)
Within ±1	18.0%
Within ±3	36.0%
Within ±5	57.0%
Within ±10	84.0%
Worst residual	-56 SAP-points

Session B success criterion is MAE ≤ 1.0 on the typical subset; we're 8× that on the first pass, which roughly matches ADR-0009's expectation that the first run shakes out spec-interpretation gaps.

Dominant failure shape: flats and bungalows under-predicted

10 of the 15 worst residuals are flats or bungalows. Pattern: calculator charges floor + roof heat loss to dwellings that don't have exposed floor / roof surfaces (mid-floor flats, top-floor flats with party ceiling, etc.).

Worst 15 (residual = predicted − actual):

Cert	actual	predicted	residual	TFA	dwelling
0320-2756-7670-2196-2035	78	22	-56	57	Semi-detached bungalow
0036-1125-8600-0165-2206	63	18	-45	42	Mid-floor flat
0340-2394-5510-2925-4421	75	35	-40	73	Mid-floor flat
9360-2179-9590-2495-2615	78	39	-39	54	Ground-floor flat
0036-0529-1500-0700-8276	75	36	-39	47	Top-floor flat
0350-2182-9590-2526-7841	43	4	-39	119	Top-floor flat
2148-3061-6204-0016-7204	81	44	-37	67	Mid-floor flat
0800-1364-0922-4522-3963	71	37	-34	70	Detached bungalow
2110-6453-5050-8205-9605	63	31	-32	43	Ground-floor maisonette
2903-8339-6962-6004-0725	75	47	-28	11	Top-floor flat
0320-2850-3380-2125-1661	70	48	-22	45	Semi-detached bungalow
8035-9023-1500-0237-3226	43	63	+20	64	Detached bungalow
9590-7751-0022-0599-3953	51	69	+18	74	Detached house
2118-1198-2619-1711-7960	62	46	-16	42	Mid-floor flat
3336-3822-5500-0437-9202	70	59	-11	73	Mid-floor maisonette

Session B iteration backlog (priority order)

S-B-flat-surfaces — Map dwelling_type to exposed floor/roof flags. Mid/top flats lose their u_floor × ground_floor_area; mid/ground flats lose their u_roof × top_floor_area. Expected impact: closes most of the −20 to −56 residuals.
S-B-heating-eff-fallback — When sap_main_heating_code is None, fall back through main_heating_category + age band to a modern-condensing-boiler efficiency, not the legacy 0.80. ~28% of our 100-cert sample had a null code with category=2.
S-B-electric-storage-tariff — Electric storage heaters (codes 401-409) should price space-heating fuel at Economy-7 low rate (Table 32 code 31, ~5.5 p/kWh), not standard rate 30. This is a 2× cost reduction on those certs.
S-B-wall-uvalue-cascade-review — Worst non-flat residuals suggest the wall U-value cascade is too conservative for recently-built / well-insulated stock. Review domain.ml.rdsap_uvalues.u_wall against RdSAP 10 Table 5.
S-B-bungalow-investigation — Bungalow residuals don't fit the flat-surfaces pattern (bungalows have full floor+roof). Hypothesis: thermal-bridging y-factor + storey-count interaction over-counts envelope. Probe specifically before deciding.
S-B-pump-fan-default — We default to 130 kWh/yr; SAP 10.3 Table 4f says higher for systems with mechanical ventilation. Marginal but consistent.

How to reproduce

python adhoc/sap_calculator/probe_n.py            # 100 certs, seed=7
python adhoc/sap_calculator/probe_n.py 500 13     # bigger sample
python adhoc/sap_calculator/probe_worst.py        # detailed cert-by-cert dump

probe_n.py runs in ~80s. Errors: 0/100. Mapper handles every real cert shape encountered.

3.9 KiB Raw Blame History Unescape Escape