100-cert probe, seed=7, sap_score window 5..99. MAE 4.29 (vs 8.41 on 2026-05-18 with the older 20..95 window — the delta blends calculator improvements with sample-window change, so this is logged as the post-P5 reference, not as "P5 reduced MAE".) P5 itself was pure trace exposure; the calculator's SAP output should be numerically unchanged. The headline finding from this run is primary-energy over-prediction: PE MAE 44.40 kWh/m², bias +39.66 — now the dominant signal with SAP residuals halved. Each end-use PE contribution surfaces on SapResult.intermediate per P5.12, so the next session can localise the bias without re-instrumenting.
9.3 KiB
Sap10Calculator parity probe — findings log
100-cert sample from data/ml_training/runs/2025_2026_n250000_v18a/data.parquet. Each dated section is a separate measurement; the calculator and/or sample window evolve between them, so direct deltas are noted explicitly.
P5 baseline — 2026-05-19 (post-P5.14)
Re-run after P5 (SapResult.intermediate trace exposure, 11 slices P5.1–P5.14). 100 certs, seed=7, sap_score window 5..99 (widened from 20..95 since the 2026-05-18 entry). 0 errors. Elapsed 71s.
| Metric | 2026-05-18 (sap 20–95) | 2026-05-19 P5 (sap 5–99) | Δ |
|---|---|---|---|
| MAE | 8.41 | 4.29 | −4.12 |
| RMSE | 13.98 | 6.83 | −7.15 |
| Bias | −2.65 | −2.15 | +0.50 |
| Within ±1 | 18.0% | 34.0% | +16 pp |
| Within ±3 | 36.0% | 62.0% | +26 pp |
| Within ±5 | 57.0% | 77.0% | +20 pp |
| Within ±10 | 84.0% | 91.0% | +7 pp |
| Worst residual | −56 | −33 | +23 |
Attribution caveat. The sample window changed (5..99 vs 20..95) — both ends were widened, so the delta blends "calculator improved" with "sample distribution shifted". The 2026-05-18 calculator state isn't reproducible from current main (the previous session's intermediate-population work landed before that probe but other intervening changes may have moved numbers too). P5 itself was pure trace exposure with one local refactor (P5.13 lifted the CO2 sum into named locals) — should be numerically neutral on SAP score, so most of the MAE drop is upstream-of-P5, not P5 itself. Treat this as the post-P5 reference baseline, not "P5 reduced MAE by 4 points."
Primary energy (kWh/m² TFA). PE MAE 44.40, PE bias +39.66 (systematic over-prediction). Cohort mean: ours 231.7 vs cert 192.0 (+40). End-use split (ours): space 168.6, HW 49.6, lighting 10.6, pumps 2.9, PV 0.0. The space-heating PE is the dominant residual — bigger than the cert delta, suggesting the fabric heat-loss + heating-efficiency cascade still over-counts for typical mid-band stock.
PE bias stratified. Worst PE bias by main_heating_category: cat 4 (range cookers, n=1) +85; cat 10 (electric storage, n=2) +94; cat 2 (gas boilers, n=88) +42. By age band: tightens monotonically newer→older (B/C/D ≈ +42-57, J/K/L ≈ +18-33). By dwelling: end-terrace bungalow +85 (n=2), mid-terrace bungalow +70 (n=3), end-terrace house +53 (n=16), mid-floor flat +7 (n=5). Mid-floor flats no longer dominate the residual — the S-B-flat-surfaces fix appears to have landed since 2026-05-18.
Worst-15 SAP residuals (this run). Dominated by end-terrace houses and mid-terrace bungalows (over-prediction) plus one 32 m² mid-terrace bungalow with main_cat=10 electric-storage where actual=37, predicted=11 (−26; the S-B-electric-storage-tariff issue from 2026-05-18 is still open).
Next iteration priorities (P5 vantage):
- Primary-energy over-prediction is the dominant signal now (PE bias +40). Was visible in the 2026-05-18 data but masked behind larger SAP residuals; with SAP MAE halved, PE is the clearest target. Hypothesis: too-high space-heating PE from either over-counting fabric heat loss or too-low main-heating efficiency cascade.
- Electric storage tariff (S-B-electric-storage-tariff) still open — the worst single residual is on a main_cat=10 cert.
- Bungalow over-prediction persists (S-B-bungalow-investigation from 2026-05-18). Not flats-shape related; thermal-bridging y-factor × storey-count interaction worth probing.
P5's actual contribution: none on accuracy; full visibility on diagnosis. Every PE bias number above is now a separate diagnostic on intermediate keys — space_heating_pe_kwh_per_m2, hot_water_pe_kwh_per_m2, etc. — so the next session can localise the over-prediction without re-instrumenting.
2026-05-18 entry (historical, pre-P5 trace)
100-cert random sample, filtered to cert sap-score 20-95 (typical band). 0 errors — calculator runs end-to-end on every cert.
Headline
| Metric | Value |
|---|---|
| MAE | 8.41 SAP-points |
| RMSE | 13.98 |
| Bias | -2.65 (slight under-prediction) |
| Within ±1 | 18.0% |
| Within ±3 | 36.0% |
| Within ±5 | 57.0% |
| Within ±10 | 84.0% |
| Worst residual | -56 SAP-points |
Session B success criterion is MAE ≤ 1.0 on the typical subset; we're 8× that on the first pass, which roughly matches ADR-0009's expectation that the first run shakes out spec-interpretation gaps.
Dominant failure shape: flats and bungalows under-predicted
10 of the 15 worst residuals are flats or bungalows. Pattern: calculator charges floor + roof heat loss to dwellings that don't have exposed floor / roof surfaces (mid-floor flats, top-floor flats with party ceiling, etc.).
Worst 15 (residual = predicted − actual):
| Cert | actual | predicted | residual | TFA | dwelling |
|---|---|---|---|---|---|
| 0320-2756-7670-2196-2035 | 78 | 22 | -56 | 57 | Semi-detached bungalow |
| 0036-1125-8600-0165-2206 | 63 | 18 | -45 | 42 | Mid-floor flat |
| 0340-2394-5510-2925-4421 | 75 | 35 | -40 | 73 | Mid-floor flat |
| 9360-2179-9590-2495-2615 | 78 | 39 | -39 | 54 | Ground-floor flat |
| 0036-0529-1500-0700-8276 | 75 | 36 | -39 | 47 | Top-floor flat |
| 0350-2182-9590-2526-7841 | 43 | 4 | -39 | 119 | Top-floor flat |
| 2148-3061-6204-0016-7204 | 81 | 44 | -37 | 67 | Mid-floor flat |
| 0800-1364-0922-4522-3963 | 71 | 37 | -34 | 70 | Detached bungalow |
| 2110-6453-5050-8205-9605 | 63 | 31 | -32 | 43 | Ground-floor maisonette |
| 2903-8339-6962-6004-0725 | 75 | 47 | -28 | 11 | Top-floor flat |
| 0320-2850-3380-2125-1661 | 70 | 48 | -22 | 45 | Semi-detached bungalow |
| 8035-9023-1500-0237-3226 | 43 | 63 | +20 | 64 | Detached bungalow |
| 9590-7751-0022-0599-3953 | 51 | 69 | +18 | 74 | Detached house |
| 2118-1198-2619-1711-7960 | 62 | 46 | -16 | 42 | Mid-floor flat |
| 3336-3822-5500-0437-9202 | 70 | 59 | -11 | 73 | Mid-floor maisonette |
Session B iteration backlog (priority order)
- S-B-flat-surfaces — Map
dwelling_typeto exposed floor/roof flags. Mid/top flats lose theiru_floor × ground_floor_area; mid/ground flats lose theiru_roof × top_floor_area. Expected impact: closes most of the −20 to −56 residuals. - S-B-heating-eff-fallback — When
sap_main_heating_codeis None, fall back throughmain_heating_category+ age band to a modern-condensing-boiler efficiency, not the legacy 0.80. ~28% of our 100-cert sample had a null code with category=2. - S-B-electric-storage-tariff — Electric storage heaters (codes 401-409) should price space-heating fuel at Economy-7 low rate (Table 32 code 31, ~5.5 p/kWh), not standard rate 30. This is a 2× cost reduction on those certs.
- S-B-wall-uvalue-cascade-review — Worst non-flat residuals suggest the wall U-value cascade is too conservative for recently-built / well-insulated stock. Review
domain.ml.rdsap_uvalues.u_wallagainst RdSAP 10 Table 5. - S-B-bungalow-investigation — Bungalow residuals don't fit the flat-surfaces pattern (bungalows have full floor+roof). Hypothesis: thermal-bridging y-factor + storey-count interaction over-counts envelope. Probe specifically before deciding.
- S-B-pump-fan-default — We default to 130 kWh/yr; SAP 10.3 Table 4f says higher for systems with mechanical ventilation. Marginal but consistent.
RdSAP10 Table 32 Energy Cost Deflator drift (P5 finding, 2026-05-19)
Surfaced while transcribing the SAP 10.2 worksheet into test_bre_worked_examples.py for P5. The calculator uses ENERGY_COST_DEFLATOR = 0.36 (SAP 10.2 Table 12, item (256) on worksheet page 146 of sap-10-2-full-specification-2025-03-14.pdf). The newer RdSAP 10 specification Table 32 (page 95 of rdsap-10-specification-2025-06-10.pdf) states the deflator is 0.42, noting "this table is equivalent to Table 12 in SAP10.2 specification" — i.e., it supersedes the SAP 10.2 value for RdSAP assessments.
Why it matters. The deflator scales the ECF numerator, which drives every SAP rating. Switching 0.36 → 0.42 changes every SAP rating numerically; in the linear regime the rating shifts by −16.21 × ΔECF, in the log regime by −120.5 × Δ(log10 ECF). For a typical dwelling this is roughly −2 to −4 SAP points across the cohort.
What it doesn't explain. The session-B residuals above are dominated by per-dwelling shape errors (mid-floor flats, bungalows) measured in tens of SAP points — a uniform 2-4 point shift won't move that needle. The deflator is a calibration call, not a model-shape fix.
Decision needed. Whether the calculator targets SAP 10.2 ratings (keep 0.36, used for full SAP / new-build EPCs) or RdSAP 10 ratings (switch to 0.42, used for existing-dwelling EPCs derived from reduced data). ADR-0010 sets the target as SAP 10.3; the rdSAP10 publication is more recent than ADR-0010's reference points and may be the operative spec for the cohort we probe against. Not changed in P5 — flagged for ADR-level resolution.
How to reproduce
python adhoc/sap_calculator/probe_n.py # 100 certs, seed=7
python adhoc/sap_calculator/probe_n.py 500 13 # bigger sample
python adhoc/sap_calculator/probe_worst.py # detailed cert-by-cert dump
probe_n.py runs in ~80s. Errors: 0/100. Mapper handles every real cert shape encountered.