From 136f149d4603c06f3bac34f625957582f7340d27 Mon Sep 17 00:00:00 2001
From: Khalim Conn-Kowlessar <kconnkowlessar@gmail.com>
Date: Mon, 18 May 2026 20:38:22 +0000
Subject: [PATCH] tooling: widen parity probe sap_score range to (5, 99)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Previous bound (20, 95) excluded full-SAP new-builds (sap_score 90+,
which carry the dramatic wall U-value gap) and deepest-tail heritage
certs (sap_score ≤ 20). Widening so the sample reflects the
populations where the calculator's biggest spec gaps live.

New baseline at 300 certs, seed=7:
  SAP MAE 5.34 → 4.59 (-0.75)
  PE MAE  48.99 → 46.78 (-2.21)
  PE bias 42.07 → 41.78 (-0.29)

Note: the v18a parquet only contains ~0.7% certs with age_band=None,
while the raw bulk zip has 15% full-SAP "Average thermal transmittance"
certs. The parquet is filtering them somewhere upstream — to be chased
in separate work. Until then, parity-probe MAE will under-show the true
corpus impact of slices that target full-SAP certs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
---
 .../src/ml_training_data/sap_parity_probe.py                | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/services/ml_training_data/src/ml_training_data/sap_parity_probe.py b/services/ml_training_data/src/ml_training_data/sap_parity_probe.py
index 6a331f82..35e6dcfa 100644
--- a/services/ml_training_data/src/ml_training_data/sap_parity_probe.py
+++ b/services/ml_training_data/src/ml_training_data/sap_parity_probe.py
@@ -37,7 +37,11 @@ _ZIP_KEYS = ("certificates-2025.json.zip", "certificates-2026.json.zip")
 
 def _sample_certs(n: int, seed: int) -> dict[str, int]:
     df = pd.read_parquet(_PARQUET, columns=["certificate_number", "sap_score"])
-    df = df[df["sap_score"].between(20, 95)]
+    # Wide range so the sample includes full-SAP new-builds (sap_score 90+)
+    # and the deepest-tail heritage/anomaly certs (sap_score ≤ 20). Earlier
+    # `between(20, 95)` excluded the populations where the calculator's
+    # biggest spec gaps tend to live.
+    df = df[df["sap_score"].between(5, 99)]
     s = df.sample(n, random_state=seed)
     return dict(zip(s["certificate_number"], s["sap_score"].astype(int)))