S-B2: parity probe + first-pass findings (100-cert baseline)

Adds services/ml_training_data/src/ml_training_data/sap_parity_probe.py — samples N certs from the v18a corpus, streams them via BulkZipReader, runs Sap10Calculator, prints MAE/RMSE/bias + worst-N residuals. Baseline across 100 certs: MAE 8.41, RMSE 13.98, bias -2.65, 0 errors. docs/sap-spec/PARITY_FINDINGS.md captures the dominant failure pattern (flats + bungalows under-predicted, 10 of the worst-15 are flats whose floor/roof are party with neighbouring dwellings) and the priority- ordered Session B iteration backlog (S-B-flat-surfaces first). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-07-27 23:35:01 +00:00 · 2026-05-18 13:59:23 +00:00 · 2026-05-18 13:59:23 +00:00 · dde8ae30fa
commit dde8ae30fa
parent 57f18a8773
2 changed files with 172 additions and 0 deletions
--- a/docs/sap-spec/PARITY_FINDINGS.md
+++ b/docs/sap-spec/PARITY_FINDINGS.md
@ -0,0 +1,61 @@
+# Sap10Calculator parity probe — findings as of 2026-05-18
+
+100-cert random sample from `data/ml_training/runs/2025_2026_n250000_v18a/data.parquet`, filtered to cert sap-score 20-95 (typical band). 0 errors — calculator runs end-to-end on every cert.
+
+## Headline
+
+| Metric | Value |
+|---|---|
+| MAE | 8.41 SAP-points |
+| RMSE | 13.98 |
+| Bias | -2.65 (slight under-prediction) |
+| Within ±1 | 18.0% |
+| Within ±3 | 36.0% |
+| Within ±5 | 57.0% |
+| Within ±10 | 84.0% |
+| Worst residual | -56 SAP-points |
+
+Session B success criterion is MAE ≤ 1.0 on the typical subset; we're 8× that on the first pass, which roughly matches ADR-0009's expectation that the first run shakes out spec-interpretation gaps.
+
+## Dominant failure shape: flats and bungalows under-predicted
+
+10 of the 15 worst residuals are flats or bungalows. **Pattern**: calculator charges floor + roof heat loss to dwellings that don't have exposed floor / roof surfaces (mid-floor flats, top-floor flats with party ceiling, etc.).
+
+Worst 15 (residual = predicted − actual):
+
+| Cert | actual | predicted | residual | TFA | dwelling |
+|---|---|---|---|---|---|
+| 0320-2756-7670-2196-2035 | 78 | 22 | -56 | 57 | Semi-detached bungalow |
+| 0036-1125-8600-0165-2206 | 63 | 18 | -45 | 42 | Mid-floor flat |
+| 0340-2394-5510-2925-4421 | 75 | 35 | -40 | 73 | Mid-floor flat |
+| 9360-2179-9590-2495-2615 | 78 | 39 | -39 | 54 | Ground-floor flat |
+| 0036-0529-1500-0700-8276 | 75 | 36 | -39 | 47 | Top-floor flat |
+| 0350-2182-9590-2526-7841 | 43 | 4 | -39 | 119 | Top-floor flat |
+| 2148-3061-6204-0016-7204 | 81 | 44 | -37 | 67 | Mid-floor flat |
+| 0800-1364-0922-4522-3963 | 71 | 37 | -34 | 70 | Detached bungalow |
+| 2110-6453-5050-8205-9605 | 63 | 31 | -32 | 43 | Ground-floor maisonette |
+| 2903-8339-6962-6004-0725 | 75 | 47 | -28 | 11 | Top-floor flat |
+| 0320-2850-3380-2125-1661 | 70 | 48 | -22 | 45 | Semi-detached bungalow |
+| 8035-9023-1500-0237-3226 | 43 | 63 | +20 | 64 | Detached bungalow |
+| 9590-7751-0022-0599-3953 | 51 | 69 | +18 | 74 | Detached house |
+| 2118-1198-2619-1711-7960 | 62 | 46 | -16 | 42 | Mid-floor flat |
+| 3336-3822-5500-0437-9202 | 70 | 59 | -11 | 73 | Mid-floor maisonette |
+
+## Session B iteration backlog (priority order)
+
+1. **S-B-flat-surfaces** — Map `dwelling_type` to exposed floor/roof flags. Mid/top flats lose their `u_floor × ground_floor_area`; mid/ground flats lose their `u_roof × top_floor_area`. Expected impact: closes most of the −20 to −56 residuals.
+2. **S-B-heating-eff-fallback** — When `sap_main_heating_code` is None, fall back through `main_heating_category` + age band to a modern-condensing-boiler efficiency, not the legacy 0.80. ~28% of our 100-cert sample had a null code with category=2.
+3. **S-B-electric-storage-tariff** — Electric storage heaters (codes 401-409) should price space-heating fuel at Economy-7 low rate (Table 32 code 31, ~5.5 p/kWh), not standard rate 30. This is a 2× cost reduction on those certs.
+4. **S-B-wall-uvalue-cascade-review** — Worst non-flat residuals suggest the wall U-value cascade is too conservative for recently-built / well-insulated stock. Review `domain.ml.rdsap_uvalues.u_wall` against RdSAP 10 Table 5.
+5. **S-B-bungalow-investigation** — Bungalow residuals don't fit the flat-surfaces pattern (bungalows have full floor+roof). Hypothesis: thermal-bridging y-factor + storey-count interaction over-counts envelope. Probe specifically before deciding.
+6. **S-B-pump-fan-default** — We default to 130 kWh/yr; SAP 10.3 Table 4f says higher for systems with mechanical ventilation. Marginal but consistent.
+
+## How to reproduce
+
+```bash
+python adhoc/sap_calculator/probe_n.py            # 100 certs, seed=7
+python adhoc/sap_calculator/probe_n.py 500 13     # bigger sample
+python adhoc/sap_calculator/probe_worst.py        # detailed cert-by-cert dump
+```
+
+`probe_n.py` runs in ~80s. Errors: 0/100. Mapper handles every real cert shape encountered.
--- a/services/ml_training_data/src/ml_training_data/sap_parity_probe.py
+++ b/services/ml_training_data/src/ml_training_data/sap_parity_probe.py
@ -0,0 +1,111 @@
+"""Sap10Calculator parity probe over N random certs from the corpus.
+
+ADR-0009 Session B exploratory tool. Loads the v18a parquet, samples N
+certs from the typical sap-score range, streams them from the bulk JSON
+ZIPs, runs the calculator, and prints the residual distribution +
+worst-N residuals for spec-iteration triage.
+
+Usage (from repo root, with the workspace venv active):
+    python -m ml_training_data.sap_parity_probe              # N=100, seed=7
+    python -m ml_training_data.sap_parity_probe 500 13       # custom N + seed
+
+Findings get written up in docs/sap-spec/PARITY_FINDINGS.md.
+"""
+from __future__ import annotations
+
+import json
+import sys
+import time
+from pathlib import Path
+from typing import Any, cast
+
+import pandas as pd
+
+from datatypes.epc.domain.mapper import EpcPropertyDataMapper
+from domain.sap.calculator import Sap10Calculator
+from ml_training_data.bulk_zip_reader import BulkZipReader
+from ml_training_data.storage import LocalStorage
+
+
+_REPO = Path(__file__).resolve().parents[4]
+_PARQUET = _REPO / "data" / "ml_training" / "runs" / "2025_2026_n250000_v18a" / "data.parquet"
+_BULK = _REPO / "data" / "ml_training" / "bulk"
+_ZIP_KEYS = ("certificates-2025.json.zip", "certificates-2026.json.zip")
+
+
+def _sample_certs(n: int, seed: int) -> dict[str, int]:
+    df = pd.read_parquet(_PARQUET, columns=["certificate_number", "sap_score"])
+    df = df[df["sap_score"].between(20, 95)]
+    s = df.sample(n, random_state=seed)
+    return dict(zip(s["certificate_number"], s["sap_score"].astype(int)))
+
+
+def main(argv: list[str] | None = None) -> None:
+    args = argv if argv is not None else sys.argv[1:]
+    n = int(args[0]) if args else 100
+    seed = int(args[1]) if len(args) > 1 else 7
+
+    targets = _sample_certs(n, seed)
+    print(f"Sampling {len(targets)} certs (seed={seed}) ...")
+    storage = LocalStorage(_BULK)
+    calc = Sap10Calculator()
+    results: list[dict[str, Any]] = []
+    errors: list[dict[str, Any]] = []
+    remaining = set(targets)
+    t0 = time.monotonic()
+    for zip_key in _ZIP_KEYS:
+        if not remaining:
+            break
+        if not storage.exists(zip_key):
+            print(f"!! missing {zip_key}", file=sys.stderr)
+            continue
+        reader = BulkZipReader(storage, zip_key)
+        for cert in reader.iter_certificates_filtered(remaining):
+            cn = cert["certificate_number"]
+            actual = targets[cn]
+            doc_field = cert.get("document")
+            document = cast(
+                dict[str, Any],
+                json.loads(doc_field) if isinstance(doc_field, str) else doc_field,
+            )
+            try:
+                epc = EpcPropertyDataMapper.from_api_response(document)
+                result = calc.calculate(epc)
+                results.append({
+                    "cert": cn,
+                    "actual": actual,
+                    "predicted": result.sap_score,
+                    "residual": result.sap_score - actual,
+                    "ecf": round(result.ecf, 3),
+                    "tfa": epc.total_floor_area_m2,
+                    "ext": epc.extensions_count,
+                    "dwelling": epc.dwelling_type,
+                })
+            except Exception as e:  # noqa: BLE001 — exploratory probe
+                errors.append({"cert": cn, "actual": actual, "error": f"{type(e).__name__}: {e}"})
+            remaining.discard(cn)
+    elapsed = time.monotonic() - t0
+    df = pd.DataFrame(results)
+    print(f"\nelapsed {elapsed:.1f}s; calculated={len(results)}, errored={len(errors)}, not_found={len(remaining)}")
+    if not df.empty:
+        df["abs_resid"] = df["residual"].abs()
+        print(f"\nMAE: {df['residual'].abs().mean():.2f}")
+        print(f"RMSE: {((df['residual'] ** 2).mean()) ** 0.5:.2f}")
+        print(f"bias: {df['residual'].mean():.2f}")
+        for thr in (1, 3, 5, 10):
+            pct = (df["abs_resid"] <= thr).mean() * 100
+            print(f"within ±{thr}: {pct:.1f}%")
+        print("\nresidual distribution:")
+        print(df["residual"].describe(percentiles=[0.05, 0.1, 0.25, 0.5, 0.75, 0.9, 0.95]))
+        print("\nworst 15 by |residual|:")
+        print(df.nlargest(15, "abs_resid")[
+            ["cert", "actual", "predicted", "residual", "ecf", "tfa", "ext", "dwelling"]
+        ].to_string(index=False))
+    if errors:
+        print("\nerrors:")
+        for e in errors[:10]:
+            print(" ", e)
+
+
+if __name__ == "__main__":
+    main()