docs: ADR-0008 physics-as-feature + v16.0.0 schema bump

Captures the slice-16 plan decisions before code lands: - Mid-physics: predicted_ecf + predicted_log10_ecf, NOT predicted_sap_score - Cost scope: heating + DHW + lighting (no PV/pumps/secondary) - Crude annual heat-demand calc (HLC * HDH / efficiency) - Cascade-defaulting U-value imputation - envelope_heat_loss_w_per_k sums all parts; extension_1 only as discrete features (88% null drops extension_2) - v16.0.0 MAJOR bump (rename secondary_dwelling_* -> extension_1_*); coordinated cutover with AutoGluon repo + scoring lambda - LightGBM objective="mape" for sap_score+peui_ucl in 16g; sample weights deferred
2026-07-27 23:35:01 +00:00 · 2026-05-17 11:20:40 +00:00 · 2026-05-17 11:20:40 +00:00 · f61d74a327
commit f61d74a327
parent fd8d71eb05
1 changed files with 109 additions and 0 deletions
--- a/docs/adr/0008-physics-as-feature.md
+++ b/docs/adr/0008-physics-as-feature.md
@ -0,0 +1,109 @@
+# Physics-derived features in the EPC ML Transform; v16.0.0 schema bump
+
+**Status: Accepted.** Extends the physics-coupling pattern from [ADR-0007](0007-kwh-as-ml-target.md) — which folded the UCL band correction into training *labels* — to the *feature* side: the EPC ML Transform v16.0.0 ships engineered features that reproduce parts of the SAP10.2 worksheet (envelope conduction, heating seasonal efficiency, fuel-cost ECF) and feeds them to the model alongside the raw cert fields.
+
+The motivating problem is that the v15.x baseline reaches MAPE 3.8% on `sap_score` and tails (SAP<40, SAP>85) carry disproportionate error. The model has access to the raw inputs that drive SAP — wall construction, age band, heating-system code, areas — but composes them into a SAP score from scratch via tree splits. We close that gap by giving the model the same intermediate quantities the SAP10.2 calculator uses internally.
+
+## Why physics-as-feature is not classical leakage
+
+In the Rebaselining use case (see CONTEXT.md and ADR-0007), the model approximates the SAP10.2 calculator. The labels (`sap_score`, kWh targets, CO2) are outputs of that calculator computed by approved assessors. The features include the physical inputs to it. Engineering features that reproduce the calculator's internal quantities — envelope heat loss, seasonal efficiency, predicted fuel cost, log10(ECF) — is not classical leakage because:
+
+1. None of these features reads the label. They read cert fabric/heating fields the SAP10.2 calculator also reads.
+2. At inference time we have those same cert fields (from Site Notes or from the public EPC + Landlord Overrides). We do not have the SAP score itself.
+3. The physics features expose intermediate calculation results to the model so it does not have to rediscover them via tree splits. This is the feature-side analogue of the label-side coupling already accepted in ADR-0007.
+
+The tautology bound is therefore the SAP10.2 worksheet itself: a feature that computes a quantity also present on the worksheet is acceptable; a feature that reads the EPC's recorded SAP score (`energy_rating_current`) is not. That latter exclusion is preserved from ADR-0007.
+
+## Depth of physics: "Mid", not "Deep"
+
+Three points on the spectrum were considered:
+
+- **Shallow** — only the raw building-physics intermediates (envelope heat loss W/K, seasonal efficiency, predicted kWh). Model learns kWh→cost→SAP unaided.
+- **Mid** — Shallow plus the cost reconstruction (`predicted_total_fuel_cost_gbp`, `predicted_ecf`, `predicted_log10_ecf`). Model still has to apply the piecewise SAP rating transform.
+- **Deep** — Mid plus `predicted_sap_score` with the SAP10.2 §20.1 piecewise log/linear formula pre-applied. Model learns residual only.
+
+We accept **Mid**. Reasons:
+
+1. The piecewise SAP rating constants (`SAP = 117 − 121·log10(ECF)` if ECF≥3.5 else `100 − 13.95·ECF`, deflator 0.42) are BRE's, version-bound to SAP10.2. Baking them into a feature means a future SAP10.3 release requires re-deriving features and re-training. Baking them only into the model's learned transform keeps the data layer SAP-version-agnostic.
+2. `predicted_log10_ecf` is monotonic with `sap_score` by construction. Tree-based models fit monotonic transforms with a small number of splits. The kink at ECF=3.5 is one extra split. We give up almost nothing in accuracy.
+3. `predicted_sap_score` would clip at high-ECF properties (the log term can push SAP < 1; the formula expects a clamp). `predicted_log10_ecf` has no such pathology.
+4. We can escalate to Deep in a later slice if Mid leaves residual MAPE above target.
+
+## Cost reconstruction scope: heating + DHW + lighting
+
+Total cost in the SAP rating sums: space heating, DHW, lighting, pumps/fans, secondary heating, minus PV credit. We include the first three; we omit the rest:
+
+- Pumps/fans and secondary heating contribute small (~2–5%) bias that is approximately constant across rows. Tree models learn a constant offset trivially.
+- PV credit requires a monthly solar simulation (Tables 6, 6d, 6e) — multi-day implementation surface. PV-heavy properties (a small fraction of the high-SAP tail) get a small per-row bias the model can mostly absorb via the PV-fabric features already in v15.x.
+- Lighting cost share varies materially by heating fuel and floor area; omitting it would create a fuel-mix-conditional bias that is harder to learn. So lighting goes in.
+
+If a future slice (17+) shows the high-SAP tail still bad after Mid + Lighting lands, the PV monthly simulation gets its own slice.
+
+## Heat-demand approximation: crude annual
+
+`predicted_space_heating_kwh` and `predicted_hot_water_kwh` are computed as:
+
+- `space ≈ envelope_heat_loss_w_per_k × HDH_region × 0.001 / efficiency_main`, where `HDH_region` is heating degree hours per year per SAP region (~22 rows, ~53,000 K·h/yr for the UK average).
+- `hot_water ≈ 4.18 × Vd × (55 − 12) × 365 × 0.001 / efficiency_water`, with `Vd = 25 × N_occupants + 36` and `N_occupants` defaulted from total floor area per SAP10.2 Appendix J.
+
+We deliberately do not port SAP10.2's monthly heat balance with solar/internal gains and utilization factors. The crude calculation has 10–30% per-row bias driven by row-correlated factors (solar gains, infiltration, occupancy). The model already sees those factors directly — envelope_heat_loss, region, occupancy proxies — so it can learn the bias as a band-conditional correction without re-deriving the underlying physics. If slice 16h's per-decile residuals (see ADR-0007 baseline + slice 15e tooling) show the crude approximation underperforming, the SAP §3 utilization-factor refinement gets its own slice.
+
+## Default U-value imputation: cascade
+
+U-value lookups (Tables 6–10 walls, 16/17/18 roofs, 19+EN ISO 13370 floors, 20 upper floors, 24 windows, 26 doors, 21 thermal-bridging factor) are wrapped in helpers that cascade-default missing fields the same way RdSAP10 §6 does:
+
+1. Use the cert value if known.
+2. Fall back to the age-band-typical construction (e.g. cavity for ≥1930, solid brick for pre-1930).
+3. Fall back to country-typical.
+4. Final fallback: a mid-band default (1.5 W/m²K for walls).
+
+`envelope_heat_loss_w_per_k` is therefore never null. The information about "this row had sparse fabric data" is already encoded in the correlated null pattern on the raw fabric features that survive into v16.
+
+## Extensions: sum-over-all, expose extension_1 only
+
+`envelope_heat_loss_w_per_k` sums over the main dwelling and every extension (`extension_1`, `extension_2`, `extension_3+`) regardless of how many are present, using each part's own age band and construction. The 250k corpus has:
+
+| Building parts | Share | Per-extension feature support |
+|---|---|---|
+| 1 (main only) | 63.0% | — |
+| 2 (main + extension_1) | 25.3% | `extension_1_*` populated |
+| 3+ | 11.7% | aggregate captures, no per-part visibility |
+
+So `extension_1_*` (renamed from v15.x `secondary_dwelling_*`) fires on 37% of certs and is worth carrying as discrete features. `extension_2_*` would fire on only 11.7% and adds clutter; we drop it. Any heat-loss contribution from extension_2+ flows through the `envelope_heat_loss_w_per_k` aggregate.
+
+## v16.0.0: a MAJOR feature-schema bump
+
+Per [ADR-0007](0007-kwh-as-ml-target.md) versioning policy: removing or renaming columns is MAJOR. Slice 16f renames every `secondary_dwelling_*` column to `extension_1_*`. The new physics features (envelope_heat_loss, predicted_*, predicted_ecf, predicted_log10_ecf, etc.) are MINOR additions on their own but ride with the rename in one cut. Result: v15.x → v16.0.0.
+
+### Cross-repo cutover
+
+The scoring lambda's tag must match the transform version. The AutoGluon training repo references the v15.x parquet schema. v16.0.0 lands as a coordinated deploy:
+
+1. Slice 16a–h ships in this repo; v16 parquet generated locally.
+2. AutoGluon repo updates column references (`secondary_dwelling_*` → `extension_1_*`; consume new physics columns).
+3. New model artifact tagged v16.0.0.
+4. Scoring lambda deployed with v16.0.0 tag concurrent with the new artifact.
+5. v15 lambda retired.
+
+Until step 4, the live v15 lambda continues serving v15 features against the v15 model. There is no intermediate state where one component is v16 and another v15.
+
+## Tail-error treatment: LightGBM objective switch, not sample weights
+
+Slice 16g switches the `sap_score` and `peui_ucl` LightGBM objective from the default `regression` (MSE) to `mape`. The reasoning is that the v15.x training loop reports MAPE while optimising MSE — a known mismatch that under-weights tail rows (a 2-point error at SAP=20 contributes the same squared loss as the same error at SAP=80 but is 4× more visible in MAPE). The `mape` objective applies gradient ∝ 1/|y|, directly compensating.
+
+Sample-weight schemes (band-bucket reweighting) are deferred. If slice 16h's per-decile residuals show the tails still problematic after the objective switch, weights layer in as 16i. The `co2_emissions` target retains the MSE default because some rows have ~zero CO2 (heavy PV); the `mape` objective destabilises near zero. Per-target objective is configured at training time, not baked into the transform.
+
+## Consequences
+
+- The EPC ML Transform owns more domain logic. It now contains the RdSAP10 U-value tables (Tables 6–10, 15–20, 24, 26), the SAP10.2 efficiency lookup (Table 4a), and the Table 32 fuel-price map. These are versioned with the transform; an upstream SAP/RdSAP revision is a transform bump.
+- The training repo (this repo) and the AutoGluon repo are tightly coupled at parquet column names. Renames are MAJOR bumps with the cutover discipline above. Adding columns is MINOR.
+- `predicted_log10_ecf` is approximately monotonic with `sap_score` by construction. Down-stream consumers should not treat it as an independent signal.
+- The physics features are deterministic given cert fields. If two rows have identical fabric+heating+geometry, their `envelope_heat_loss_w_per_k`, `predicted_total_fuel_cost_gbp`, and `predicted_log10_ecf` are identical. The model's residual must therefore explain SAP differences arising from non-deterministic cert calculator nuance (assessor variability, rounding, solar/utilization factors we did not port).
+- A SAP10.3 release would invalidate the SAP10.2 fuel prices, efficiencies, and rating-formula constants used here. Treat such a release as a transform MAJOR bump with new lookup tables, not a hot-fix.
+
+## What this ADR does not change
+
+- The set of ML targets remains the six from ADR-0007: `sap_score`, `co2_emissions`, `peui_raw`, `peui_ucl`, `space_heating_kwh`, `hot_water_kwh`. The new features ride alongside the existing v15 features; nothing in the target set moves.
+- `energy_rating_current` and any SAP-band-derived field remain excluded from features per ADR-0007.
+- The `EpcEnergyDerivationService` runtime path is unaffected. Bills and fuel splits remain deterministic from kWh × current Fuel Rates.
+- The 250k 2025+2026 SAP10 RdSAP corpus continues to be the training set; v16 is a column-schema change, not a data-source change.