mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
tooling: widen parity probe sap_score range to (5, 99)
Previous bound (20, 95) excluded full-SAP new-builds (sap_score 90+, which carry the dramatic wall U-value gap) and deepest-tail heritage certs (sap_score ≤ 20). Widening so the sample reflects the populations where the calculator's biggest spec gaps live. New baseline at 300 certs, seed=7: SAP MAE 5.34 → 4.59 (-0.75) PE MAE 48.99 → 46.78 (-2.21) PE bias 42.07 → 41.78 (-0.29) Note: the v18a parquet only contains ~0.7% certs with age_band=None, while the raw bulk zip has 15% full-SAP "Average thermal transmittance" certs. The parquet is filtering them somewhere upstream — to be chased in separate work. Until then, parity-probe MAE will under-show the true corpus impact of slices that target full-SAP certs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
parent
9a509e4102
commit
136f149d46
1 changed files with 5 additions and 1 deletions
|
|
@ -37,7 +37,11 @@ _ZIP_KEYS = ("certificates-2025.json.zip", "certificates-2026.json.zip")
|
|||
|
||||
def _sample_certs(n: int, seed: int) -> dict[str, int]:
|
||||
df = pd.read_parquet(_PARQUET, columns=["certificate_number", "sap_score"])
|
||||
df = df[df["sap_score"].between(20, 95)]
|
||||
# Wide range so the sample includes full-SAP new-builds (sap_score 90+)
|
||||
# and the deepest-tail heritage/anomaly certs (sap_score ≤ 20). Earlier
|
||||
# `between(20, 95)` excluded the populations where the calculator's
|
||||
# biggest spec gaps tend to live.
|
||||
df = df[df["sap_score"].between(5, 99)]
|
||||
s = df.sample(n, random_state=seed)
|
||||
return dict(zip(s["certificate_number"], s["sap_score"].astype(int)))
|
||||
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue