100-cert probe, seed=7, sap_score window 5..99. MAE 4.29
(vs 8.41 on 2026-05-18 with the older 20..95 window — the
delta blends calculator improvements with sample-window
change, so this is logged as the post-P5 reference, not as
"P5 reduced MAE".)
P5 itself was pure trace exposure; the calculator's SAP
output should be numerically unchanged. The headline finding
from this run is primary-energy over-prediction: PE MAE
44.40 kWh/m², bias +39.66 — now the dominant signal with
SAP residuals halved. Each end-use PE contribution surfaces
on SapResult.intermediate per P5.12, so the next session
can localise the bias without re-instrumenting.
Closes the second half of P5 (HANDOVER_SYSTEMATIC_REVIEW §2.5):
- Adds test_bre_worked_examples.py — one comprehensive test that
locks every published SapResult.intermediate key against its
SAP 10.2 worksheet item number ((4) TFA, (33) fabric heat loss,
(39) HTC, (40) HLP, (73) gains, (93) mean internal temp, (98c)
space heating, (240e/247/250) costs, (252) PV credit, (256)
deflator, (257) ECF, (261-272) per-end-use CO2, (275-287)
primary energy per m²). All formulas derived independently from
the worksheet pages 131-148; passes against the synthetic
100 m² baseline.
- Explicit caveat in module docstring: BRE-published worked
examples don't exist in any of the three SAP-spec PDFs we have
(rdSAP10, SAP10.2, SAP10.3 — all greppped). The test is
spec-formula-derived, not BRE-validated. Structure stays if
BRE numbers surface later; only expected values change.
Also surfaces and documents an RdSAP10 spec drift in
PARITY_FINDINGS.md: Table 32 (page 95 of rdSAP10) gives
Energy Cost Deflator = 0.42, vs the code's 0.36 (SAP10.2 Table 12,
worksheet item (256)). Not changed in P5 — needs ADR-level
resolution on whether the calculator targets SAP10.2 (0.36) or
RdSAP10 (0.42) ratings.
P5 (SapResult.intermediate population + BRE worked-example
fixtures) is now complete on this branch.
Adds services/ml_training_data/src/ml_training_data/sap_parity_probe.py
— samples N certs from the v18a corpus, streams them via BulkZipReader,
runs Sap10Calculator, prints MAE/RMSE/bias + worst-N residuals. Baseline
across 100 certs: MAE 8.41, RMSE 13.98, bias -2.65, 0 errors.
docs/sap-spec/PARITY_FINDINGS.md captures the dominant failure pattern
(flats + bungalows under-predicted, 10 of the worst-15 are flats whose
floor/roof are party with neighbouring dwellings) and the priority-
ordered Session B iteration backlog (S-B-flat-surfaces first).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>