docs: ADR-0008 physics-as-feature + v16.0.0 schema bump

Captures the slice-16 plan decisions before code lands:
- Mid-physics: predicted_ecf + predicted_log10_ecf, NOT predicted_sap_score
- Cost scope: heating + DHW + lighting (no PV/pumps/secondary)
- Crude annual heat-demand calc (HLC * HDH / efficiency)
- Cascade-defaulting U-value imputation
- envelope_heat_loss_w_per_k sums all parts; extension_1 only as discrete features (88% null drops extension_2)
- v16.0.0 MAJOR bump (rename secondary_dwelling_* -> extension_1_*); coordinated cutover with AutoGluon repo + scoring lambda
- LightGBM objective="mape" for sap_score+peui_ucl in 16g; sample weights deferred
This commit is contained in:
Khalim Conn-Kowlessar 2026-05-17 11:20:40 +00:00
parent fd8d71eb05
commit f61d74a327

View file

@ -0,0 +1,109 @@
# Physics-derived features in the EPC ML Transform; v16.0.0 schema bump
**Status: Accepted.** Extends the physics-coupling pattern from [ADR-0007](0007-kwh-as-ml-target.md) — which folded the UCL band correction into training *labels* — to the *feature* side: the EPC ML Transform v16.0.0 ships engineered features that reproduce parts of the SAP10.2 worksheet (envelope conduction, heating seasonal efficiency, fuel-cost ECF) and feeds them to the model alongside the raw cert fields.
The motivating problem is that the v15.x baseline reaches MAPE 3.8% on `sap_score` and tails (SAP<40, SAP>85) carry disproportionate error. The model has access to the raw inputs that drive SAP — wall construction, age band, heating-system code, areas — but composes them into a SAP score from scratch via tree splits. We close that gap by giving the model the same intermediate quantities the SAP10.2 calculator uses internally.
## Why physics-as-feature is not classical leakage
In the Rebaselining use case (see CONTEXT.md and ADR-0007), the model approximates the SAP10.2 calculator. The labels (`sap_score`, kWh targets, CO2) are outputs of that calculator computed by approved assessors. The features include the physical inputs to it. Engineering features that reproduce the calculator's internal quantities — envelope heat loss, seasonal efficiency, predicted fuel cost, log10(ECF) — is not classical leakage because:
1. None of these features reads the label. They read cert fabric/heating fields the SAP10.2 calculator also reads.
2. At inference time we have those same cert fields (from Site Notes or from the public EPC + Landlord Overrides). We do not have the SAP score itself.
3. The physics features expose intermediate calculation results to the model so it does not have to rediscover them via tree splits. This is the feature-side analogue of the label-side coupling already accepted in ADR-0007.
The tautology bound is therefore the SAP10.2 worksheet itself: a feature that computes a quantity also present on the worksheet is acceptable; a feature that reads the EPC's recorded SAP score (`energy_rating_current`) is not. That latter exclusion is preserved from ADR-0007.
## Depth of physics: "Mid", not "Deep"
Three points on the spectrum were considered:
- **Shallow** — only the raw building-physics intermediates (envelope heat loss W/K, seasonal efficiency, predicted kWh). Model learns kWh→cost→SAP unaided.
- **Mid** — Shallow plus the cost reconstruction (`predicted_total_fuel_cost_gbp`, `predicted_ecf`, `predicted_log10_ecf`). Model still has to apply the piecewise SAP rating transform.
- **Deep** — Mid plus `predicted_sap_score` with the SAP10.2 §20.1 piecewise log/linear formula pre-applied. Model learns residual only.
We accept **Mid**. Reasons:
1. The piecewise SAP rating constants (`SAP = 117 121·log10(ECF)` if ECF≥3.5 else `100 13.95·ECF`, deflator 0.42) are BRE's, version-bound to SAP10.2. Baking them into a feature means a future SAP10.3 release requires re-deriving features and re-training. Baking them only into the model's learned transform keeps the data layer SAP-version-agnostic.
2. `predicted_log10_ecf` is monotonic with `sap_score` by construction. Tree-based models fit monotonic transforms with a small number of splits. The kink at ECF=3.5 is one extra split. We give up almost nothing in accuracy.
3. `predicted_sap_score` would clip at high-ECF properties (the log term can push SAP < 1; the formula expects a clamp). `predicted_log10_ecf` has no such pathology.
4. We can escalate to Deep in a later slice if Mid leaves residual MAPE above target.
## Cost reconstruction scope: heating + DHW + lighting
Total cost in the SAP rating sums: space heating, DHW, lighting, pumps/fans, secondary heating, minus PV credit. We include the first three; we omit the rest:
- Pumps/fans and secondary heating contribute small (~25%) bias that is approximately constant across rows. Tree models learn a constant offset trivially.
- PV credit requires a monthly solar simulation (Tables 6, 6d, 6e) — multi-day implementation surface. PV-heavy properties (a small fraction of the high-SAP tail) get a small per-row bias the model can mostly absorb via the PV-fabric features already in v15.x.
- Lighting cost share varies materially by heating fuel and floor area; omitting it would create a fuel-mix-conditional bias that is harder to learn. So lighting goes in.
If a future slice (17+) shows the high-SAP tail still bad after Mid + Lighting lands, the PV monthly simulation gets its own slice.
## Heat-demand approximation: crude annual
`predicted_space_heating_kwh` and `predicted_hot_water_kwh` are computed as:
- `space ≈ envelope_heat_loss_w_per_k × HDH_region × 0.001 / efficiency_main`, where `HDH_region` is heating degree hours per year per SAP region (~22 rows, ~53,000 K·h/yr for the UK average).
- `hot_water ≈ 4.18 × Vd × (55 12) × 365 × 0.001 / efficiency_water`, with `Vd = 25 × N_occupants + 36` and `N_occupants` defaulted from total floor area per SAP10.2 Appendix J.
We deliberately do not port SAP10.2's monthly heat balance with solar/internal gains and utilization factors. The crude calculation has 1030% per-row bias driven by row-correlated factors (solar gains, infiltration, occupancy). The model already sees those factors directly — envelope_heat_loss, region, occupancy proxies — so it can learn the bias as a band-conditional correction without re-deriving the underlying physics. If slice 16h's per-decile residuals (see ADR-0007 baseline + slice 15e tooling) show the crude approximation underperforming, the SAP §3 utilization-factor refinement gets its own slice.
## Default U-value imputation: cascade
U-value lookups (Tables 610 walls, 16/17/18 roofs, 19+EN ISO 13370 floors, 20 upper floors, 24 windows, 26 doors, 21 thermal-bridging factor) are wrapped in helpers that cascade-default missing fields the same way RdSAP10 §6 does:
1. Use the cert value if known.
2. Fall back to the age-band-typical construction (e.g. cavity for ≥1930, solid brick for pre-1930).
3. Fall back to country-typical.
4. Final fallback: a mid-band default (1.5 W/m²K for walls).
`envelope_heat_loss_w_per_k` is therefore never null. The information about "this row had sparse fabric data" is already encoded in the correlated null pattern on the raw fabric features that survive into v16.
## Extensions: sum-over-all, expose extension_1 only
`envelope_heat_loss_w_per_k` sums over the main dwelling and every extension (`extension_1`, `extension_2`, `extension_3+`) regardless of how many are present, using each part's own age band and construction. The 250k corpus has:
| Building parts | Share | Per-extension feature support |
|---|---|---|
| 1 (main only) | 63.0% | — |
| 2 (main + extension_1) | 25.3% | `extension_1_*` populated |
| 3+ | 11.7% | aggregate captures, no per-part visibility |
So `extension_1_*` (renamed from v15.x `secondary_dwelling_*`) fires on 37% of certs and is worth carrying as discrete features. `extension_2_*` would fire on only 11.7% and adds clutter; we drop it. Any heat-loss contribution from extension_2+ flows through the `envelope_heat_loss_w_per_k` aggregate.
## v16.0.0: a MAJOR feature-schema bump
Per [ADR-0007](0007-kwh-as-ml-target.md) versioning policy: removing or renaming columns is MAJOR. Slice 16f renames every `secondary_dwelling_*` column to `extension_1_*`. The new physics features (envelope_heat_loss, predicted_*, predicted_ecf, predicted_log10_ecf, etc.) are MINOR additions on their own but ride with the rename in one cut. Result: v15.x → v16.0.0.
### Cross-repo cutover
The scoring lambda's tag must match the transform version. The AutoGluon training repo references the v15.x parquet schema. v16.0.0 lands as a coordinated deploy:
1. Slice 16ah ships in this repo; v16 parquet generated locally.
2. AutoGluon repo updates column references (`secondary_dwelling_*``extension_1_*`; consume new physics columns).
3. New model artifact tagged v16.0.0.
4. Scoring lambda deployed with v16.0.0 tag concurrent with the new artifact.
5. v15 lambda retired.
Until step 4, the live v15 lambda continues serving v15 features against the v15 model. There is no intermediate state where one component is v16 and another v15.
## Tail-error treatment: LightGBM objective switch, not sample weights
Slice 16g switches the `sap_score` and `peui_ucl` LightGBM objective from the default `regression` (MSE) to `mape`. The reasoning is that the v15.x training loop reports MAPE while optimising MSE — a known mismatch that under-weights tail rows (a 2-point error at SAP=20 contributes the same squared loss as the same error at SAP=80 but is 4× more visible in MAPE). The `mape` objective applies gradient ∝ 1/|y|, directly compensating.
Sample-weight schemes (band-bucket reweighting) are deferred. If slice 16h's per-decile residuals show the tails still problematic after the objective switch, weights layer in as 16i. The `co2_emissions` target retains the MSE default because some rows have ~zero CO2 (heavy PV); the `mape` objective destabilises near zero. Per-target objective is configured at training time, not baked into the transform.
## Consequences
- The EPC ML Transform owns more domain logic. It now contains the RdSAP10 U-value tables (Tables 610, 1520, 24, 26), the SAP10.2 efficiency lookup (Table 4a), and the Table 32 fuel-price map. These are versioned with the transform; an upstream SAP/RdSAP revision is a transform bump.
- The training repo (this repo) and the AutoGluon repo are tightly coupled at parquet column names. Renames are MAJOR bumps with the cutover discipline above. Adding columns is MINOR.
- `predicted_log10_ecf` is approximately monotonic with `sap_score` by construction. Down-stream consumers should not treat it as an independent signal.
- The physics features are deterministic given cert fields. If two rows have identical fabric+heating+geometry, their `envelope_heat_loss_w_per_k`, `predicted_total_fuel_cost_gbp`, and `predicted_log10_ecf` are identical. The model's residual must therefore explain SAP differences arising from non-deterministic cert calculator nuance (assessor variability, rounding, solar/utilization factors we did not port).
- A SAP10.3 release would invalidate the SAP10.2 fuel prices, efficiencies, and rating-formula constants used here. Treat such a release as a transform MAJOR bump with new lookup tables, not a hot-fix.
## What this ADR does not change
- The set of ML targets remains the six from ADR-0007: `sap_score`, `co2_emissions`, `peui_raw`, `peui_ucl`, `space_heating_kwh`, `hot_water_kwh`. The new features ride alongside the existing v15 features; nothing in the target set moves.
- `energy_rating_current` and any SAP-band-derived field remain excluded from features per ADR-0007.
- The `EpcEnergyDerivationService` runtime path is unaffected. Bills and fuel splits remain deterministic from kWh × current Fuel Rates.
- The 250k 2025+2026 SAP10 RdSAP corpus continues to be the training set; v16 is a column-schema change, not a data-source change.