Commit graph

14 commits

Author SHA1 Message Date
Khalim Conn-Kowlessar
700ff4640c slice 16g: LightGBM objective=mape for sap_score + peui_ucl
Per ADR-0008: the v15 baseline reports MAPE but optimises MSE, which
under-weights tail rows. Switching to objective='mape' applies gradient
proportional to 1/|y| and lets the model focus where MAPE penalises.

Targets co2_emissions, space_heating_kwh, hot_water_kwh, and peui_raw
retain the default 'regression' objective (some rows have ~zero CO2 from
heavy PV; MAPE objective destabilises near zero).

Sample weights deferred to slice 16i if slice 16h's per-decile residuals
still show tail bias after the objective switch.
2026-05-17 12:06:13 +00:00
Khalim Conn-Kowlessar
fd8d71eb05 slice 15e: per-decile residuals reporting in train_baseline
Adds `_per_decile_residuals` and writes `residuals_<target>.json` next to
metrics.json. Buckets test-set rows by deciles of the true target value;
each bucket carries count + MAPE + MAE + mean residual + true_min/max.

Lets us tell whether errors concentrate in the tails of the true distribution
(e.g. SAP<40 / SAP>85) vs the mid-band — which the global MAPE alone hides.
Baseline for slice 16's MAPE-improvement ablations.
2026-05-17 11:18:40 +00:00
Khalim Conn-Kowlessar
a1f89b6033 slice 15c: stream build_features so 500k+ cert runs fit memory
Previously kept the full list of EpcPropertyData in memory before calling
EpcMlTransform.to_rows. For the 25k slice that's ~30 MB; for the 580k
full-2026 corpus it OOM-killed the process silently. Now: parse cert ->
to_row -> append dict -> drop EpcPropertyData reference, so memory is
O(row-dict * n) instead of O(EpcPropertyData * n). Same end-of-frame
post-processing (categorical casts, column-order pin).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 00:36:53 +00:00
Khalim Conn-Kowlessar
c496f345f8 slice 14l: bigger-run fixes — UCL guard, PV Measurement coercion, sMAPE
Three changes surfaced by the 25k 2026 run:
- transform._peui_ucl returns None for non-positive raw PEUI (net-exporters).
  apply_ucl_correction would otherwise raise ValueError on negative input.
- PhotovoltaicArray scalars (peak_power, pitch, orientation, overshading)
  now accept Measurement | int | float in the schema; mapper coerces via
  _measurement_value.
- train_baseline reports sMAPE alongside MAPE — handles zero-actual rows
  (e.g. co2_emissions for net-zero certs) where MAPE explodes.

Results at N=25,000 RdSAP 2026 certs (~32s end-to-end):
  sap_score          MAPE=0.064  sMAPE=0.054  R^2=0.762
  co2_emissions      sMAPE=0.140  R^2=0.890
  peui_raw           MAPE=0.126  sMAPE=0.120  R^2=0.714
  peui_ucl           MAPE=0.114  sMAPE=0.108  R^2=0.736
  space_heating_kwh  MAPE=0.167  sMAPE=0.157  R^2=0.915
  hot_water_kwh      MAPE=0.089  sMAPE=0.086  R^2=0.737

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 21:15:37 +00:00
Khalim Conn-Kowlessar
8fddd25b9a slice 14k: E2E pipeline runs on real 2026 RdSAP certs
Two production fixes surfaced by the live run:
- mapper.from_rdsap_schema_21_0_1 now sets the three ML target scalars
  (energy_rating_current, co2_emissions_current, energy_consumption_current).
  They were silently None for every cert before, leaving the only labels as
  the kWh fields from renewable_heat_incentive.
- train_baseline coerces object-dtype columns to numeric (None -> NaN) and
  drops rows with null target per fit, so LightGBM accepts the frame.

E2E on 500 real certs (~1s):
  sap_score             R^2=0.604  MAPE=0.084
  co2_emissions         R^2=0.813  MAPE=0.130
  peui_raw              R^2=0.979  MAPE=0.026
  space_heating_kwh     R^2=0.823  MAPE=0.213
  hot_water_kwh         R^2=0.519  MAPE=0.115

peui_ucl excluded: UCL correction still needs wiring.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 20:47:41 +00:00
Khalim Conn-Kowlessar
611c07de94 slice 14h: handle real bulk-JSON shape (NDJSON wrappers + document payload)
Bulk entries are NDJSON of wrapper records, not a JSON array. Each wrapper
carries certificate_number, assessment_type, and a stringified document with
the actual EPC schema payload. Filter to RdSAP, unwrap document, then map.

remote_bulk_fetcher: per-entry presigned-URL refresh (30s S3 TTL).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 19:45:52 +00:00
Khalim Conn-Kowlessar
9eb70cede1 slice 14g: remote_bulk_fetcher extracts ZIP entries via HTTP Range (no full download)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 19:16:52 +00:00
Khalim Conn-Kowlessar
b676e05d49 slice 14f: train_baseline fits LightGBM per target, emits MAPE/R^2 + importance
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:47:49 +00:00
Khalim Conn-Kowlessar
23ba2ef271 slice 14e: write_training_dataset emits parquet + schema.json + manifest.json
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:43:31 +00:00
Khalim Conn-Kowlessar
20fd55d5a1 slice 14d: build_features wires bulk reader -> mapper -> EpcMlTransform
ijson use_float fixes Decimal/float coercion when streaming JSON.
pyright extraPaths so the new pkg type-checks against domna-domain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:38:41 +00:00
Khalim Conn-Kowlessar
0ff9d546b8 slice 14c: BulkZipReader streams certs from gov bulk JSON ZIP
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:27:24 +00:00
Khalim Conn-Kowlessar
7a6c8b4f24 slice 14b: Storage protocol + LocalStorage impl
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 17:52:54 +00:00
Khalim Conn-Kowlessar
eb42cb88a1 slice 14a: ml_training_data pkg + sample.py (CSV filter + random sample)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 17:39:43 +00:00
Khalim Conn-Kowlessar
dfe9e3ddbe added potential file scaffolding: 2026-05-15 10:56:53 +00:00