Model/docs/sap-spec/PARITY_FINDINGS.md
Khalim Conn-Kowlessar a1c9d2a14d Record post-P5 parity-probe baseline (2026-05-19)
100-cert probe, seed=7, sap_score window 5..99. MAE 4.29
(vs 8.41 on 2026-05-18 with the older 20..95 window — the
delta blends calculator improvements with sample-window
change, so this is logged as the post-P5 reference, not as
"P5 reduced MAE".)

P5 itself was pure trace exposure; the calculator's SAP
output should be numerically unchanged. The headline finding
from this run is primary-energy over-prediction: PE MAE
44.40 kWh/m², bias +39.66 — now the dominant signal with
SAP residuals halved. Each end-use PE contribution surfaces
on SapResult.intermediate per P5.12, so the next session
can localise the bias without re-instrumenting.
2026-05-19 16:19:01 +00:00

9.3 KiB
Raw Blame History

Sap10Calculator parity probe — findings log

100-cert sample from data/ml_training/runs/2025_2026_n250000_v18a/data.parquet. Each dated section is a separate measurement; the calculator and/or sample window evolve between them, so direct deltas are noted explicitly.

P5 baseline — 2026-05-19 (post-P5.14)

Re-run after P5 (SapResult.intermediate trace exposure, 11 slices P5.1P5.14). 100 certs, seed=7, sap_score window 5..99 (widened from 20..95 since the 2026-05-18 entry). 0 errors. Elapsed 71s.

Metric 2026-05-18 (sap 2095) 2026-05-19 P5 (sap 599) Δ
MAE 8.41 4.29 4.12
RMSE 13.98 6.83 7.15
Bias 2.65 2.15 +0.50
Within ±1 18.0% 34.0% +16 pp
Within ±3 36.0% 62.0% +26 pp
Within ±5 57.0% 77.0% +20 pp
Within ±10 84.0% 91.0% +7 pp
Worst residual 56 33 +23

Attribution caveat. The sample window changed (5..99 vs 20..95) — both ends were widened, so the delta blends "calculator improved" with "sample distribution shifted". The 2026-05-18 calculator state isn't reproducible from current main (the previous session's intermediate-population work landed before that probe but other intervening changes may have moved numbers too). P5 itself was pure trace exposure with one local refactor (P5.13 lifted the CO2 sum into named locals) — should be numerically neutral on SAP score, so most of the MAE drop is upstream-of-P5, not P5 itself. Treat this as the post-P5 reference baseline, not "P5 reduced MAE by 4 points."

Primary energy (kWh/m² TFA). PE MAE 44.40, PE bias +39.66 (systematic over-prediction). Cohort mean: ours 231.7 vs cert 192.0 (+40). End-use split (ours): space 168.6, HW 49.6, lighting 10.6, pumps 2.9, PV 0.0. The space-heating PE is the dominant residual — bigger than the cert delta, suggesting the fabric heat-loss + heating-efficiency cascade still over-counts for typical mid-band stock.

PE bias stratified. Worst PE bias by main_heating_category: cat 4 (range cookers, n=1) +85; cat 10 (electric storage, n=2) +94; cat 2 (gas boilers, n=88) +42. By age band: tightens monotonically newer→older (B/C/D ≈ +42-57, J/K/L ≈ +18-33). By dwelling: end-terrace bungalow +85 (n=2), mid-terrace bungalow +70 (n=3), end-terrace house +53 (n=16), mid-floor flat +7 (n=5). Mid-floor flats no longer dominate the residual — the S-B-flat-surfaces fix appears to have landed since 2026-05-18.

Worst-15 SAP residuals (this run). Dominated by end-terrace houses and mid-terrace bungalows (over-prediction) plus one 32 m² mid-terrace bungalow with main_cat=10 electric-storage where actual=37, predicted=11 (26; the S-B-electric-storage-tariff issue from 2026-05-18 is still open).

Next iteration priorities (P5 vantage):

  1. Primary-energy over-prediction is the dominant signal now (PE bias +40). Was visible in the 2026-05-18 data but masked behind larger SAP residuals; with SAP MAE halved, PE is the clearest target. Hypothesis: too-high space-heating PE from either over-counting fabric heat loss or too-low main-heating efficiency cascade.
  2. Electric storage tariff (S-B-electric-storage-tariff) still open — the worst single residual is on a main_cat=10 cert.
  3. Bungalow over-prediction persists (S-B-bungalow-investigation from 2026-05-18). Not flats-shape related; thermal-bridging y-factor × storey-count interaction worth probing.

P5's actual contribution: none on accuracy; full visibility on diagnosis. Every PE bias number above is now a separate diagnostic on intermediate keys — space_heating_pe_kwh_per_m2, hot_water_pe_kwh_per_m2, etc. — so the next session can localise the over-prediction without re-instrumenting.


2026-05-18 entry (historical, pre-P5 trace)

100-cert random sample, filtered to cert sap-score 20-95 (typical band). 0 errors — calculator runs end-to-end on every cert.

Headline

Metric Value
MAE 8.41 SAP-points
RMSE 13.98
Bias -2.65 (slight under-prediction)
Within ±1 18.0%
Within ±3 36.0%
Within ±5 57.0%
Within ±10 84.0%
Worst residual -56 SAP-points

Session B success criterion is MAE ≤ 1.0 on the typical subset; we're 8× that on the first pass, which roughly matches ADR-0009's expectation that the first run shakes out spec-interpretation gaps.

Dominant failure shape: flats and bungalows under-predicted

10 of the 15 worst residuals are flats or bungalows. Pattern: calculator charges floor + roof heat loss to dwellings that don't have exposed floor / roof surfaces (mid-floor flats, top-floor flats with party ceiling, etc.).

Worst 15 (residual = predicted actual):

Cert actual predicted residual TFA dwelling
0320-2756-7670-2196-2035 78 22 -56 57 Semi-detached bungalow
0036-1125-8600-0165-2206 63 18 -45 42 Mid-floor flat
0340-2394-5510-2925-4421 75 35 -40 73 Mid-floor flat
9360-2179-9590-2495-2615 78 39 -39 54 Ground-floor flat
0036-0529-1500-0700-8276 75 36 -39 47 Top-floor flat
0350-2182-9590-2526-7841 43 4 -39 119 Top-floor flat
2148-3061-6204-0016-7204 81 44 -37 67 Mid-floor flat
0800-1364-0922-4522-3963 71 37 -34 70 Detached bungalow
2110-6453-5050-8205-9605 63 31 -32 43 Ground-floor maisonette
2903-8339-6962-6004-0725 75 47 -28 11 Top-floor flat
0320-2850-3380-2125-1661 70 48 -22 45 Semi-detached bungalow
8035-9023-1500-0237-3226 43 63 +20 64 Detached bungalow
9590-7751-0022-0599-3953 51 69 +18 74 Detached house
2118-1198-2619-1711-7960 62 46 -16 42 Mid-floor flat
3336-3822-5500-0437-9202 70 59 -11 73 Mid-floor maisonette

Session B iteration backlog (priority order)

  1. S-B-flat-surfaces — Map dwelling_type to exposed floor/roof flags. Mid/top flats lose their u_floor × ground_floor_area; mid/ground flats lose their u_roof × top_floor_area. Expected impact: closes most of the 20 to 56 residuals.
  2. S-B-heating-eff-fallback — When sap_main_heating_code is None, fall back through main_heating_category + age band to a modern-condensing-boiler efficiency, not the legacy 0.80. ~28% of our 100-cert sample had a null code with category=2.
  3. S-B-electric-storage-tariff — Electric storage heaters (codes 401-409) should price space-heating fuel at Economy-7 low rate (Table 32 code 31, ~5.5 p/kWh), not standard rate 30. This is a 2× cost reduction on those certs.
  4. S-B-wall-uvalue-cascade-review — Worst non-flat residuals suggest the wall U-value cascade is too conservative for recently-built / well-insulated stock. Review domain.ml.rdsap_uvalues.u_wall against RdSAP 10 Table 5.
  5. S-B-bungalow-investigation — Bungalow residuals don't fit the flat-surfaces pattern (bungalows have full floor+roof). Hypothesis: thermal-bridging y-factor + storey-count interaction over-counts envelope. Probe specifically before deciding.
  6. S-B-pump-fan-default — We default to 130 kWh/yr; SAP 10.3 Table 4f says higher for systems with mechanical ventilation. Marginal but consistent.

RdSAP10 Table 32 Energy Cost Deflator drift (P5 finding, 2026-05-19)

Surfaced while transcribing the SAP 10.2 worksheet into test_bre_worked_examples.py for P5. The calculator uses ENERGY_COST_DEFLATOR = 0.36 (SAP 10.2 Table 12, item (256) on worksheet page 146 of sap-10-2-full-specification-2025-03-14.pdf). The newer RdSAP 10 specification Table 32 (page 95 of rdsap-10-specification-2025-06-10.pdf) states the deflator is 0.42, noting "this table is equivalent to Table 12 in SAP10.2 specification" — i.e., it supersedes the SAP 10.2 value for RdSAP assessments.

Why it matters. The deflator scales the ECF numerator, which drives every SAP rating. Switching 0.36 → 0.42 changes every SAP rating numerically; in the linear regime the rating shifts by 16.21 × ΔECF, in the log regime by 120.5 × Δ(log10 ECF). For a typical dwelling this is roughly 2 to 4 SAP points across the cohort.

What it doesn't explain. The session-B residuals above are dominated by per-dwelling shape errors (mid-floor flats, bungalows) measured in tens of SAP points — a uniform 2-4 point shift won't move that needle. The deflator is a calibration call, not a model-shape fix.

Decision needed. Whether the calculator targets SAP 10.2 ratings (keep 0.36, used for full SAP / new-build EPCs) or RdSAP 10 ratings (switch to 0.42, used for existing-dwelling EPCs derived from reduced data). ADR-0010 sets the target as SAP 10.3; the rdSAP10 publication is more recent than ADR-0010's reference points and may be the operative spec for the cohort we probe against. Not changed in P5 — flagged for ADR-level resolution.

How to reproduce

python adhoc/sap_calculator/probe_n.py            # 100 certs, seed=7
python adhoc/sap_calculator/probe_n.py 500 13     # bigger sample
python adhoc/sap_calculator/probe_worst.py        # detailed cert-by-cert dump

probe_n.py runs in ~80s. Errors: 0/100. Mapper handles every real cert shape encountered.