Record post-P5 parity-probe baseline (2026-05-19)

100-cert probe, seed=7, sap_score window 5..99. MAE 4.29 (vs 8.41 on 2026-05-18 with the older 20..95 window — the delta blends calculator improvements with sample-window change, so this is logged as the post-P5 reference, not as "P5 reduced MAE".) P5 itself was pure trace exposure; the calculator's SAP output should be numerically unchanged. The headline finding from this run is primary-energy over-prediction: PE MAE 44.40 kWh/m², bias +39.66 — now the dominant signal with SAP residuals halved. Each end-use PE contribution surfaces on SapResult.intermediate per P5.12, so the next session can localise the bias without re-instrumenting.
2026-08-02 21:08:24 +00:00 · 2026-05-19 16:19:01 +00:00 · 2026-05-19 16:19:01 +00:00 · a1c9d2a14d
commit a1c9d2a14d
parent 411c477d09
1 changed files with 38 additions and 2 deletions
--- a/docs/sap-spec/PARITY_FINDINGS.md
+++ b/docs/sap-spec/PARITY_FINDINGS.md
@ -1,6 +1,42 @@
-# Sap10Calculator parity probe — findings as of 2026-05-18
+# Sap10Calculator parity probe — findings log

-100-cert random sample from `data/ml_training/runs/2025_2026_n250000_v18a/data.parquet`, filtered to cert sap-score 20-95 (typical band). 0 errors — calculator runs end-to-end on every cert.
+100-cert sample from `data/ml_training/runs/2025_2026_n250000_v18a/data.parquet`. Each dated section is a separate measurement; the calculator and/or sample window evolve between them, so direct deltas are noted explicitly.
+
+## P5 baseline — 2026-05-19 (post-P5.14)
+
+Re-run after P5 (`SapResult.intermediate` trace exposure, 11 slices P5.1–P5.14). 100 certs, seed=7, `sap_score` window **5..99** (widened from 20..95 since the 2026-05-18 entry). 0 errors. Elapsed 71s.
+
+| Metric | 2026-05-18 (sap 20–95) | 2026-05-19 P5 (sap 5–99) | Δ |
+|---|---|---|---|
+| MAE | 8.41 | **4.29** | −4.12 |
+| RMSE | 13.98 | **6.83** | −7.15 |
+| Bias | −2.65 | **−2.15** | +0.50 |
+| Within ±1 | 18.0% | **34.0%** | +16 pp |
+| Within ±3 | 36.0% | **62.0%** | +26 pp |
+| Within ±5 | 57.0% | **77.0%** | +20 pp |
+| Within ±10 | 84.0% | **91.0%** | +7 pp |
+| Worst residual | −56 | **−33** | +23 |
+
+**Attribution caveat.** The sample window changed (5..99 vs 20..95) — both ends were widened, so the delta blends "calculator improved" with "sample distribution shifted". The 2026-05-18 calculator state isn't reproducible from current main (the previous session's intermediate-population work landed before that probe but other intervening changes may have moved numbers too). P5 itself was pure trace exposure with one local refactor (P5.13 lifted the CO2 sum into named locals) — should be **numerically neutral on SAP score**, so most of the MAE drop is upstream-of-P5, not P5 itself. Treat this as the **post-P5 reference baseline**, not "P5 reduced MAE by 4 points."
+
+**Primary energy (kWh/m² TFA).** PE MAE **44.40**, PE bias **+39.66** (systematic over-prediction). Cohort mean: ours 231.7 vs cert 192.0 (+40). End-use split (ours): space 168.6, HW 49.6, lighting 10.6, pumps 2.9, PV 0.0. The space-heating PE is the dominant residual — bigger than the cert delta, suggesting the **fabric heat-loss + heating-efficiency cascade still over-counts** for typical mid-band stock.
+
+**PE bias stratified.** Worst PE bias by main_heating_category: cat 4 (range cookers, n=1) +85; cat 10 (electric storage, n=2) +94; cat 2 (gas boilers, n=88) +42. By age band: tightens monotonically newer→older (B/C/D ≈ +42-57, J/K/L ≈ +18-33). By dwelling: end-terrace bungalow +85 (n=2), mid-terrace bungalow +70 (n=3), end-terrace house +53 (n=16), mid-floor flat +7 (n=5). Mid-floor flats no longer dominate the residual — the S-B-flat-surfaces fix appears to have landed since 2026-05-18.
+
+**Worst-15 SAP residuals (this run).** Dominated by end-terrace houses and mid-terrace bungalows (over-prediction) plus one 32 m² mid-terrace bungalow with main_cat=10 electric-storage where actual=37, predicted=11 (−26; the S-B-electric-storage-tariff issue from 2026-05-18 is still open).
+
+**Next iteration priorities (P5 vantage):**
+1. **Primary-energy over-prediction** is the dominant signal now (PE bias +40). Was visible in the 2026-05-18 data but masked behind larger SAP residuals; with SAP MAE halved, PE is the clearest target. Hypothesis: too-high space-heating PE from either over-counting fabric heat loss or too-low main-heating efficiency cascade.
+2. **Electric storage tariff (S-B-electric-storage-tariff)** still open — the worst single residual is on a main_cat=10 cert.
+3. **Bungalow over-prediction** persists (S-B-bungalow-investigation from 2026-05-18). Not flats-shape related; thermal-bridging y-factor × storey-count interaction worth probing.
+
+P5's actual contribution: **none on accuracy; full visibility on diagnosis.** Every PE bias number above is now a separate diagnostic on `intermediate` keys — `space_heating_pe_kwh_per_m2`, `hot_water_pe_kwh_per_m2`, etc. — so the next session can localise the over-prediction without re-instrumenting.
+
+---
+
+## 2026-05-18 entry (historical, pre-P5 trace)
+
+100-cert random sample, filtered to cert sap-score 20-95 (typical band). 0 errors — calculator runs end-to-end on every cert.

 ## Headline