Per ADR-0008: the v15 baseline reports MAPE but optimises MSE, which
under-weights tail rows. Switching to objective='mape' applies gradient
proportional to 1/|y| and lets the model focus where MAPE penalises.
Targets co2_emissions, space_heating_kwh, hot_water_kwh, and peui_raw
retain the default 'regression' objective (some rows have ~zero CO2 from
heavy PV; MAPE objective destabilises near zero).
Sample weights deferred to slice 16i if slice 16h's per-decile residuals
still show tail bias after the objective switch.
Adds `_per_decile_residuals` and writes `residuals_<target>.json` next to
metrics.json. Buckets test-set rows by deciles of the true target value;
each bucket carries count + MAPE + MAE + mean residual + true_min/max.
Lets us tell whether errors concentrate in the tails of the true distribution
(e.g. SAP<40 / SAP>85) vs the mid-band — which the global MAPE alone hides.
Baseline for slice 16's MAPE-improvement ablations.
Previously kept the full list of EpcPropertyData in memory before calling
EpcMlTransform.to_rows. For the 25k slice that's ~30 MB; for the 580k
full-2026 corpus it OOM-killed the process silently. Now: parse cert ->
to_row -> append dict -> drop EpcPropertyData reference, so memory is
O(row-dict * n) instead of O(EpcPropertyData * n). Same end-of-frame
post-processing (categorical casts, column-order pin).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two production fixes surfaced by the live run:
- mapper.from_rdsap_schema_21_0_1 now sets the three ML target scalars
(energy_rating_current, co2_emissions_current, energy_consumption_current).
They were silently None for every cert before, leaving the only labels as
the kWh fields from renewable_heat_incentive.
- train_baseline coerces object-dtype columns to numeric (None -> NaN) and
drops rows with null target per fit, so LightGBM accepts the frame.
E2E on 500 real certs (~1s):
sap_score R^2=0.604 MAPE=0.084
co2_emissions R^2=0.813 MAPE=0.130
peui_raw R^2=0.979 MAPE=0.026
space_heating_kwh R^2=0.823 MAPE=0.213
hot_water_kwh R^2=0.519 MAPE=0.115
peui_ucl excluded: UCL correction still needs wiring.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bulk entries are NDJSON of wrapper records, not a JSON array. Each wrapper
carries certificate_number, assessment_type, and a stringified document with
the actual EPC schema payload. Filter to RdSAP, unwrap document, then map.
remote_bulk_fetcher: per-entry presigned-URL refresh (30s S3 TTL).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ijson use_float fixes Decimal/float coercion when streaming JSON.
pyright extraPaths so the new pkg type-checks against domna-domain.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>