diff --git a/docs/adr/0008-physics-as-feature.md b/docs/adr/0008-physics-as-feature.md new file mode 100644 index 00000000..8307c3f6 --- /dev/null +++ b/docs/adr/0008-physics-as-feature.md @@ -0,0 +1,109 @@ +# Physics-derived features in the EPC ML Transform; v16.0.0 schema bump + +**Status: Accepted.** Extends the physics-coupling pattern from [ADR-0007](0007-kwh-as-ml-target.md) — which folded the UCL band correction into training *labels* — to the *feature* side: the EPC ML Transform v16.0.0 ships engineered features that reproduce parts of the SAP10.2 worksheet (envelope conduction, heating seasonal efficiency, fuel-cost ECF) and feeds them to the model alongside the raw cert fields. + +The motivating problem is that the v15.x baseline reaches MAPE 3.8% on `sap_score` and tails (SAP<40, SAP>85) carry disproportionate error. The model has access to the raw inputs that drive SAP — wall construction, age band, heating-system code, areas — but composes them into a SAP score from scratch via tree splits. We close that gap by giving the model the same intermediate quantities the SAP10.2 calculator uses internally. + +## Why physics-as-feature is not classical leakage + +In the Rebaselining use case (see CONTEXT.md and ADR-0007), the model approximates the SAP10.2 calculator. The labels (`sap_score`, kWh targets, CO2) are outputs of that calculator computed by approved assessors. The features include the physical inputs to it. Engineering features that reproduce the calculator's internal quantities — envelope heat loss, seasonal efficiency, predicted fuel cost, log10(ECF) — is not classical leakage because: + +1. None of these features reads the label. They read cert fabric/heating fields the SAP10.2 calculator also reads. +2. At inference time we have those same cert fields (from Site Notes or from the public EPC + Landlord Overrides). We do not have the SAP score itself. +3. The physics features expose intermediate calculation results to the model so it does not have to rediscover them via tree splits. This is the feature-side analogue of the label-side coupling already accepted in ADR-0007. + +The tautology bound is therefore the SAP10.2 worksheet itself: a feature that computes a quantity also present on the worksheet is acceptable; a feature that reads the EPC's recorded SAP score (`energy_rating_current`) is not. That latter exclusion is preserved from ADR-0007. + +## Depth of physics: "Mid", not "Deep" + +Three points on the spectrum were considered: + +- **Shallow** — only the raw building-physics intermediates (envelope heat loss W/K, seasonal efficiency, predicted kWh). Model learns kWh→cost→SAP unaided. +- **Mid** — Shallow plus the cost reconstruction (`predicted_total_fuel_cost_gbp`, `predicted_ecf`, `predicted_log10_ecf`). Model still has to apply the piecewise SAP rating transform. +- **Deep** — Mid plus `predicted_sap_score` with the SAP10.2 §20.1 piecewise log/linear formula pre-applied. Model learns residual only. + +We accept **Mid**. Reasons: + +1. The piecewise SAP rating constants (`SAP = 117 − 121·log10(ECF)` if ECF≥3.5 else `100 − 13.95·ECF`, deflator 0.42) are BRE's, version-bound to SAP10.2. Baking them into a feature means a future SAP10.3 release requires re-deriving features and re-training. Baking them only into the model's learned transform keeps the data layer SAP-version-agnostic. +2. `predicted_log10_ecf` is monotonic with `sap_score` by construction. Tree-based models fit monotonic transforms with a small number of splits. The kink at ECF=3.5 is one extra split. We give up almost nothing in accuracy. +3. `predicted_sap_score` would clip at high-ECF properties (the log term can push SAP < 1; the formula expects a clamp). `predicted_log10_ecf` has no such pathology. +4. We can escalate to Deep in a later slice if Mid leaves residual MAPE above target. + +## Cost reconstruction scope: heating + DHW + lighting + +Total cost in the SAP rating sums: space heating, DHW, lighting, pumps/fans, secondary heating, minus PV credit. We include the first three; we omit the rest: + +- Pumps/fans and secondary heating contribute small (~2–5%) bias that is approximately constant across rows. Tree models learn a constant offset trivially. +- PV credit requires a monthly solar simulation (Tables 6, 6d, 6e) — multi-day implementation surface. PV-heavy properties (a small fraction of the high-SAP tail) get a small per-row bias the model can mostly absorb via the PV-fabric features already in v15.x. +- Lighting cost share varies materially by heating fuel and floor area; omitting it would create a fuel-mix-conditional bias that is harder to learn. So lighting goes in. + +If a future slice (17+) shows the high-SAP tail still bad after Mid + Lighting lands, the PV monthly simulation gets its own slice. + +## Heat-demand approximation: crude annual + +`predicted_space_heating_kwh` and `predicted_hot_water_kwh` are computed as: + +- `space ≈ envelope_heat_loss_w_per_k × HDH_region × 0.001 / efficiency_main`, where `HDH_region` is heating degree hours per year per SAP region (~22 rows, ~53,000 K·h/yr for the UK average). +- `hot_water ≈ 4.18 × Vd × (55 − 12) × 365 × 0.001 / efficiency_water`, with `Vd = 25 × N_occupants + 36` and `N_occupants` defaulted from total floor area per SAP10.2 Appendix J. + +We deliberately do not port SAP10.2's monthly heat balance with solar/internal gains and utilization factors. The crude calculation has 10–30% per-row bias driven by row-correlated factors (solar gains, infiltration, occupancy). The model already sees those factors directly — envelope_heat_loss, region, occupancy proxies — so it can learn the bias as a band-conditional correction without re-deriving the underlying physics. If slice 16h's per-decile residuals (see ADR-0007 baseline + slice 15e tooling) show the crude approximation underperforming, the SAP §3 utilization-factor refinement gets its own slice. + +## Default U-value imputation: cascade + +U-value lookups (Tables 6–10 walls, 16/17/18 roofs, 19+EN ISO 13370 floors, 20 upper floors, 24 windows, 26 doors, 21 thermal-bridging factor) are wrapped in helpers that cascade-default missing fields the same way RdSAP10 §6 does: + +1. Use the cert value if known. +2. Fall back to the age-band-typical construction (e.g. cavity for ≥1930, solid brick for pre-1930). +3. Fall back to country-typical. +4. Final fallback: a mid-band default (1.5 W/m²K for walls). + +`envelope_heat_loss_w_per_k` is therefore never null. The information about "this row had sparse fabric data" is already encoded in the correlated null pattern on the raw fabric features that survive into v16. + +## Extensions: sum-over-all, expose extension_1 only + +`envelope_heat_loss_w_per_k` sums over the main dwelling and every extension (`extension_1`, `extension_2`, `extension_3+`) regardless of how many are present, using each part's own age band and construction. The 250k corpus has: + +| Building parts | Share | Per-extension feature support | +|---|---|---| +| 1 (main only) | 63.0% | — | +| 2 (main + extension_1) | 25.3% | `extension_1_*` populated | +| 3+ | 11.7% | aggregate captures, no per-part visibility | + +So `extension_1_*` (renamed from v15.x `secondary_dwelling_*`) fires on 37% of certs and is worth carrying as discrete features. `extension_2_*` would fire on only 11.7% and adds clutter; we drop it. Any heat-loss contribution from extension_2+ flows through the `envelope_heat_loss_w_per_k` aggregate. + +## v16.0.0: a MAJOR feature-schema bump + +Per [ADR-0007](0007-kwh-as-ml-target.md) versioning policy: removing or renaming columns is MAJOR. Slice 16f renames every `secondary_dwelling_*` column to `extension_1_*`. The new physics features (envelope_heat_loss, predicted_*, predicted_ecf, predicted_log10_ecf, etc.) are MINOR additions on their own but ride with the rename in one cut. Result: v15.x → v16.0.0. + +### Cross-repo cutover + +The scoring lambda's tag must match the transform version. The AutoGluon training repo references the v15.x parquet schema. v16.0.0 lands as a coordinated deploy: + +1. Slice 16a–h ships in this repo; v16 parquet generated locally. +2. AutoGluon repo updates column references (`secondary_dwelling_*` → `extension_1_*`; consume new physics columns). +3. New model artifact tagged v16.0.0. +4. Scoring lambda deployed with v16.0.0 tag concurrent with the new artifact. +5. v15 lambda retired. + +Until step 4, the live v15 lambda continues serving v15 features against the v15 model. There is no intermediate state where one component is v16 and another v15. + +## Tail-error treatment: LightGBM objective switch, not sample weights + +Slice 16g switches the `sap_score` and `peui_ucl` LightGBM objective from the default `regression` (MSE) to `mape`. The reasoning is that the v15.x training loop reports MAPE while optimising MSE — a known mismatch that under-weights tail rows (a 2-point error at SAP=20 contributes the same squared loss as the same error at SAP=80 but is 4× more visible in MAPE). The `mape` objective applies gradient ∝ 1/|y|, directly compensating. + +Sample-weight schemes (band-bucket reweighting) are deferred. If slice 16h's per-decile residuals show the tails still problematic after the objective switch, weights layer in as 16i. The `co2_emissions` target retains the MSE default because some rows have ~zero CO2 (heavy PV); the `mape` objective destabilises near zero. Per-target objective is configured at training time, not baked into the transform. + +## Consequences + +- The EPC ML Transform owns more domain logic. It now contains the RdSAP10 U-value tables (Tables 6–10, 15–20, 24, 26), the SAP10.2 efficiency lookup (Table 4a), and the Table 32 fuel-price map. These are versioned with the transform; an upstream SAP/RdSAP revision is a transform bump. +- The training repo (this repo) and the AutoGluon repo are tightly coupled at parquet column names. Renames are MAJOR bumps with the cutover discipline above. Adding columns is MINOR. +- `predicted_log10_ecf` is approximately monotonic with `sap_score` by construction. Down-stream consumers should not treat it as an independent signal. +- The physics features are deterministic given cert fields. If two rows have identical fabric+heating+geometry, their `envelope_heat_loss_w_per_k`, `predicted_total_fuel_cost_gbp`, and `predicted_log10_ecf` are identical. The model's residual must therefore explain SAP differences arising from non-deterministic cert calculator nuance (assessor variability, rounding, solar/utilization factors we did not port). +- A SAP10.3 release would invalidate the SAP10.2 fuel prices, efficiencies, and rating-formula constants used here. Treat such a release as a transform MAJOR bump with new lookup tables, not a hot-fix. + +## What this ADR does not change + +- The set of ML targets remains the six from ADR-0007: `sap_score`, `co2_emissions`, `peui_raw`, `peui_ucl`, `space_heating_kwh`, `hot_water_kwh`. The new features ride alongside the existing v15 features; nothing in the target set moves. +- `energy_rating_current` and any SAP-band-derived field remain excluded from features per ADR-0007. +- The `EpcEnergyDerivationService` runtime path is unaffected. Bills and fuel splits remain deterministic from kWh × current Fuel Rates. +- The 250k 2025+2026 SAP10 RdSAP corpus continues to be the training set; v16 is a column-schema change, not a data-source change.