ADR-0010 P2: cert-calibration layer is deleted, the probe uses
SAP_10_2_SPEC_PRICES (already defined in cert_to_inputs.py). Extracts
a pure predict_sap_for_cert(cert_document, *, prices) -> int helper
out of main()'s inline pipeline so the spec-prices path is unit-
testable in isolation; the helper is also reusable for P3's cohort-
filtered probe variant.
The pinned regression value (SAP=67 for cert 6035-7729 under spec
prices, vs the cert's lodged SAP of 73 under cert-cal prices) lives
in services/ml_training_data/tests/unit/test_sap_parity_probe.py.
It will drift as P4 (PCDB) and the section sweep land their fixes;
that's expected.
cert_calibration_prices is still imported by test_golden_fixtures.py
and the table_12_cert_calibration module is intact. P2.2/P2.3 retire
those.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Previous bound (20, 95) excluded full-SAP new-builds (sap_score 90+,
which carry the dramatic wall U-value gap) and deepest-tail heritage
certs (sap_score ≤ 20). Widening so the sample reflects the
populations where the calculator's biggest spec gaps live.
New baseline at 300 certs, seed=7:
SAP MAE 5.34 → 4.59 (-0.75)
PE MAE 48.99 → 46.78 (-2.21)
PE bias 42.07 → 41.78 (-0.29)
Note: the v18a parquet only contains ~0.7% certs with age_band=None,
while the raw bulk zip has 15% full-SAP "Average thermal transmittance"
certs. The parquet is filtering them somewhere upstream — to be chased
in separate work. Until then, parity-probe MAE will under-show the true
corpus impact of slices that target full-SAP certs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds primary-energy breakdown (space heating, hot water, lighting,
pumps, PV) per cert plus stratified bias reports by main_heating_
category, construction_age_band, and dwelling_type. Used to localise
the +51 kWh/m² PEUI bias to envelope-side over-prediction on pre-1996
fabric, which the bare SAP-residual ranking didn't surface.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires SAP 10.2 Table 12 "Primary energy factor" column into Table 12
helpers and onto CalculatorInputs as three per-end-use factors (space
heating, hot water, other). calculate_sap_from_inputs now emits
primary_energy_kwh_per_yr and primary_energy_kwh_per_m2 on SapResult,
matching the cert's `energy_consumption_current` field (PEUI).
Triggered by a decomposition that revealed I'd been comparing our
delivered energy to the cert's primary energy — apples to oranges.
With proper primary-energy comparison the actual finding is:
300-cert primary-energy diff (cert calibration prices):
energy MAE: 57.3 kWh/m²
energy bias: +51.6 (we over-predict by ~50%)
energy P50: +49.5
This is a much bigger systemic bug than the SAP MAE 5.34 suggested.
Closing it requires investigating either (a) demand model
over-prediction, (b) HW losses, (c) PEF values per fuel, or (d) cert
reporting convention differences. Targeted for the next context.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Separates the SAP-spec source of truth from the empirical cert-
calibration prices. cert_to_inputs() now accepts a `prices: PriceTable`
parameter defaulting to SAP_10_2_SPEC_PRICES (3.64 gas, 16.49 elec,
9.40 7h-low — verbatim from SAP 10.2 §12.2 / Table 12). Parity probe
passes the empirical cert_calibration_prices() factory from
domain.sap.tables.table_12_cert_calibration which carries the lower
prices that match the cert assessor software's actual output (3.48,
13.19, 5.50).
This split is documented in both table modules: cert calibration is
explicitly NOT spec-correct, it just matches observed cert behaviour
for parity testing.
100-cert parity probe with cert-calibration prices:
MAE 6.66 → 4.99 (recovered from spec-price regression; also -0.41
from absolute baseline thanks to other S-B fixes)
RMSE 10.29 → 7.13
bias -4.66 → -1.03
within ±1: 20% → 23%
within ±3: 38% → 47%
within ±5: 63% → 67%
within ±10: 82% → 93%
Session-B progress overall (S-B2 baseline → here): MAE 8.41 → 4.99,
within ±1 doubled (10% → 23%).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds services/ml_training_data/src/ml_training_data/sap_parity_probe.py
— samples N certs from the v18a corpus, streams them via BulkZipReader,
runs Sap10Calculator, prints MAE/RMSE/bias + worst-N residuals. Baseline
across 100 certs: MAE 8.41, RMSE 13.98, bias -2.65, 0 errors.
docs/sap-spec/PARITY_FINDINGS.md captures the dominant failure pattern
(flats + bungalows under-predicted, 10 of the worst-15 are flats whose
floor/roof are party with neighbouring dwellings) and the priority-
ordered Session B iteration backlog (S-B-flat-surfaces first).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
train_baseline now returns mae + rmse alongside mape/smape/r2. MAE is the
user-facing metric ("predicted SAP within N points"); RMSE the quadratic
counterpart. Both come straight from sklearn.
New sample_weight_fn parameter: callable(y_train) -> per-row weights.
Threads into LGBMRegressor.fit's sample_weight argument. Default None
preserves existing behaviour.
Default tail strategy exposed as low_sap_tail_weight(y, threshold=58,
weight=3): 3x weight where SAP < 58. Threshold picked from slice 16h's
per-decile residuals — decile 0 (SAP 1-58) carries 17% MAPE vs <5% body.
Three TDD tracers, all AAA.
250k retrain showed objective='mape' loses ~0.6 percentage points of
global sap_score MAPE (3.92% with regression vs 4.50% with mape) and
~0.7 pts on peui_ucl. The mape objective over-weights the low-SAP tail
(weight ~1/y) and drags the body MAPE up by more than it gains in the
tail.
Body MAPE on v16 features is already strong (2.38% on deciles 1-8); the
remaining tail bias at decile 0 (SAP<58, +3.1 bias) needs a different
fix -- sample weights or stratified loss -- queued as slice 16i.
Per ADR-0008: the v15 baseline reports MAPE but optimises MSE, which
under-weights tail rows. Switching to objective='mape' applies gradient
proportional to 1/|y| and lets the model focus where MAPE penalises.
Targets co2_emissions, space_heating_kwh, hot_water_kwh, and peui_raw
retain the default 'regression' objective (some rows have ~zero CO2 from
heavy PV; MAPE objective destabilises near zero).
Sample weights deferred to slice 16i if slice 16h's per-decile residuals
still show tail bias after the objective switch.
Adds `_per_decile_residuals` and writes `residuals_<target>.json` next to
metrics.json. Buckets test-set rows by deciles of the true target value;
each bucket carries count + MAPE + MAE + mean residual + true_min/max.
Lets us tell whether errors concentrate in the tails of the true distribution
(e.g. SAP<40 / SAP>85) vs the mid-band — which the global MAPE alone hides.
Baseline for slice 16's MAPE-improvement ablations.
Previously kept the full list of EpcPropertyData in memory before calling
EpcMlTransform.to_rows. For the 25k slice that's ~30 MB; for the 580k
full-2026 corpus it OOM-killed the process silently. Now: parse cert ->
to_row -> append dict -> drop EpcPropertyData reference, so memory is
O(row-dict * n) instead of O(EpcPropertyData * n). Same end-of-frame
post-processing (categorical casts, column-order pin).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two production fixes surfaced by the live run:
- mapper.from_rdsap_schema_21_0_1 now sets the three ML target scalars
(energy_rating_current, co2_emissions_current, energy_consumption_current).
They were silently None for every cert before, leaving the only labels as
the kWh fields from renewable_heat_incentive.
- train_baseline coerces object-dtype columns to numeric (None -> NaN) and
drops rows with null target per fit, so LightGBM accepts the frame.
E2E on 500 real certs (~1s):
sap_score R^2=0.604 MAPE=0.084
co2_emissions R^2=0.813 MAPE=0.130
peui_raw R^2=0.979 MAPE=0.026
space_heating_kwh R^2=0.823 MAPE=0.213
hot_water_kwh R^2=0.519 MAPE=0.115
peui_ucl excluded: UCL correction still needs wiring.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bulk entries are NDJSON of wrapper records, not a JSON array. Each wrapper
carries certificate_number, assessment_type, and a stringified document with
the actual EPC schema payload. Filter to RdSAP, unwrap document, then map.
remote_bulk_fetcher: per-entry presigned-URL refresh (30s S3 TTL).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ijson use_float fixes Decimal/float coercion when streaming JSON.
pyright extraPaths so the new pkg type-checks against domna-domain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>