Model/docs/adr/0007-kwh-as-ml-target.md
2026-05-16 14:15:56 +00:00

57 lines
6.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Space heating and hot water kWh are ML targets; UCL is folded into training labels
**Status: Accepted.** Supersedes [ADR-0006](0006-deterministic-kwh-no-baseline-ml.md).
The EPC ML Transform predicts **six targets**: `sap_score`, `co2_emissions`, `peui_raw`, `peui_ucl`, `space_heating_kwh`, `hot_water_kwh`. Two of these (`space_heating_kwh`, `hot_water_kwh`) were explicitly excluded from ML by ADR-0006. We reverse that decision for two independent reasons, the second of which was the deciding factor.
## Why baseline kWh becomes an ML target
The premise of ADR-0006 was that baseline kWh has no clean source in the gov data and must be derived deterministically from SAP physics + UCL correction. That premise no longer holds:
1. The New EPC API exposes `renewable_heat_incentive.space_heating_existing_dwelling` and `renewable_heat_incentive.water_heating` directly as integers (kWh/yr delivered) on every SAP10 certificate. For a SAP10-baseline property, baseline kWh is a lookup, not a derivation — no SAP-physics port required.
2. **But** for the *Rebaselining* path (pre-SAP10 EPCs being scored against SAP10 methodology) and for *post-measure* impact prediction (the state after a measure is installed), no recorded kWh exists. The choice there is: derive deterministically (the ADR-0006 stance), or predict via ML alongside SAP / carbon / heat. Reason (2) below resolves this in favour of ML.
## Why UCL is folded into training labels rather than applied at runtime
The UCL per-band correction (Few et al. 2023) is a piecewise-linear function of PEUI keyed on EPC band. Applied at runtime, post-prediction, it produces a **discontinuity at band boundaries**: when a simulated package of measures pushes a property from band D into band C, the per-band slope/intercept switches discontinuously, and the UCL-adjusted kWh can move in the opposite direction to the underlying PEUI prediction. This was observed in practice on the legacy `model_engine`.
Folding UCL into the training labels — i.e. computing UCL-corrected PEUI per training row using the row's recorded band, then fitting the model on the corrected target — means the trained model emits metered-equivalent PEUI directly. There is no per-band switching at inference. The discontinuity disappears. The model learns a smooth function over the feature space.
The same logic motivates ML prediction of space heating and hot water kWh post-measure: deterministic derivation from a SAP-delta would reintroduce a similar band-boundary artefact at every step where heating efficiency or fuel changes. A single ML model emitting kWh directly is smooth across measure transitions.
## Scope of the reversal
| Quantity | ADR-0006 stance | ADR-0007 stance |
|---|---|---|
| Baseline SAP / carbon / heat demand | ML (unchanged) | ML (unchanged) |
| Baseline PEUI (`peui_raw`) | Read from EPC; UCL-corrected at runtime | Read from EPC at baseline; ML target with UCL-corrected variant (`peui_ucl`) at training time |
| Baseline space heating kWh | Deterministic from SAP physics + UCL | Read from EPC for SAP10 baselines; ML for Rebaselining + post-measure |
| Baseline hot water kWh | Deterministic from SAP physics + UCL | Read from EPC for SAP10 baselines; ML for Rebaselining + post-measure |
| Post-measure space heating kWh delta | Derived from SAP delta + heating fuel/COP | ML target (predicted directly post-measure) |
| Post-measure hot water kWh delta | Derived from SAP delta | ML target (predicted directly post-measure) |
| Fuel split, bills | Deterministic from kWh × Fuel Rates (unchanged) | Deterministic from kWh × Fuel Rates (unchanged) |
| Carbon factors → CO2 emissions | Deterministic from kWh × Carbon Factors (unchanged at runtime) | Deterministic from kWh × Carbon Factors (unchanged at runtime); ML target also separately for Rebaselining |
| UCL correction application point | Runtime, post-prediction, per band | Training time, folded into PEUI labels per row's recorded band |
## Dual PEUI training targets
We train two PEUI variants — `peui_raw` (the EPC's `energy_consumption_current` directly) and `peui_ucl` (the same value with the row's recorded-band UCL correction pre-applied). At v0.1.0 we compare both empirically. The variant with better held-out MAPE wins; the loser is dropped at v0.2.0.
## Label coupling, not classical leakage
The UCL transform uses the row's recorded SAP-derived band to compute the PEUI label, and SAP score is itself an ML target. This couples the two targets at the label level. It is **not** classical leakage (the band is not in the feature set; the model never reads it as input). The PEUI prediction is independent of the SAP prediction at inference. We accept the coupling as the price of avoiding the band-boundary discontinuity, consistent with our explicit "park target-independence" decision — the six targets are predicted independently and small cross-target inconsistencies are tolerated for v1.
Practical safeguard: `energy_rating_current` and any other SAP-score-derived field (e.g. `current_energy_efficiency_band`) are **excluded from the feature set** in the EPC ML Transform, to avoid an entirely separate target-leakage path on the SAP prediction.
## Consequences
- `EpcEnergyDerivationService` is no longer the source of baseline kWh. Its remaining job is the deterministic step from kWh + Fuel Rates → fuel split + bills, and kWh + Carbon Factors → CO2 emissions. UCL is removed from its runtime path; the `AnnualBillSavings.adjust_energy_to_metered` port that ADR-0006 anticipated does not happen — UCL moves into the training-side EPC ML Transform.
- The EPC ML Transform owns both feature definitions *and* the per-row UCL label transformation. It is the single artefact tying SAP-band semantics into the training data; cross-repo consumers (AutoGluon) see only post-transform parquet.
- `FuelRatesRepo`, `CarbonFactorsRepo`, and `HeatingSystemAssumptionsRepo` survive but their `HeatingSystemAssumptionsRepo` consumers shrink — the SAP-physics-decomposition path that ADR-0006 envisaged is unused.
- Adding more ML targets later (lighting kWh, appliance kWh, cooking kWh) becomes a feature-additive change rather than an architectural one — the precedent of "kWh as ML target" is now established.
## What this ADR does not change
- Per-recommendation **cost** delta is still deterministic, from kWh delta × current Fuel Rates.
- Bills surfaced to the UI are always current-rate, never pinned to EPC inspection-date rates.
- `EpcEnergyDerivationService` is preserved as the bills/fuel-split service; only its responsibility shrinks.