mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
docs: ADR-0008 physics-as-feature + v16.0.0 schema bump
Captures the slice-16 plan decisions before code lands: - Mid-physics: predicted_ecf + predicted_log10_ecf, NOT predicted_sap_score - Cost scope: heating + DHW + lighting (no PV/pumps/secondary) - Crude annual heat-demand calc (HLC * HDH / efficiency) - Cascade-defaulting U-value imputation - envelope_heat_loss_w_per_k sums all parts; extension_1 only as discrete features (88% null drops extension_2) - v16.0.0 MAJOR bump (rename secondary_dwelling_* -> extension_1_*); coordinated cutover with AutoGluon repo + scoring lambda - LightGBM objective="mape" for sap_score+peui_ucl in 16g; sample weights deferred
This commit is contained in:
parent
fd8d71eb05
commit
f61d74a327
1 changed files with 109 additions and 0 deletions
109
docs/adr/0008-physics-as-feature.md
Normal file
109
docs/adr/0008-physics-as-feature.md
Normal file
|
|
@ -0,0 +1,109 @@
|
|||
# Physics-derived features in the EPC ML Transform; v16.0.0 schema bump
|
||||
|
||||
**Status: Accepted.** Extends the physics-coupling pattern from [ADR-0007](0007-kwh-as-ml-target.md) — which folded the UCL band correction into training *labels* — to the *feature* side: the EPC ML Transform v16.0.0 ships engineered features that reproduce parts of the SAP10.2 worksheet (envelope conduction, heating seasonal efficiency, fuel-cost ECF) and feeds them to the model alongside the raw cert fields.
|
||||
|
||||
The motivating problem is that the v15.x baseline reaches MAPE 3.8% on `sap_score` and tails (SAP<40, SAP>85) carry disproportionate error. The model has access to the raw inputs that drive SAP — wall construction, age band, heating-system code, areas — but composes them into a SAP score from scratch via tree splits. We close that gap by giving the model the same intermediate quantities the SAP10.2 calculator uses internally.
|
||||
|
||||
## Why physics-as-feature is not classical leakage
|
||||
|
||||
In the Rebaselining use case (see CONTEXT.md and ADR-0007), the model approximates the SAP10.2 calculator. The labels (`sap_score`, kWh targets, CO2) are outputs of that calculator computed by approved assessors. The features include the physical inputs to it. Engineering features that reproduce the calculator's internal quantities — envelope heat loss, seasonal efficiency, predicted fuel cost, log10(ECF) — is not classical leakage because:
|
||||
|
||||
1. None of these features reads the label. They read cert fabric/heating fields the SAP10.2 calculator also reads.
|
||||
2. At inference time we have those same cert fields (from Site Notes or from the public EPC + Landlord Overrides). We do not have the SAP score itself.
|
||||
3. The physics features expose intermediate calculation results to the model so it does not have to rediscover them via tree splits. This is the feature-side analogue of the label-side coupling already accepted in ADR-0007.
|
||||
|
||||
The tautology bound is therefore the SAP10.2 worksheet itself: a feature that computes a quantity also present on the worksheet is acceptable; a feature that reads the EPC's recorded SAP score (`energy_rating_current`) is not. That latter exclusion is preserved from ADR-0007.
|
||||
|
||||
## Depth of physics: "Mid", not "Deep"
|
||||
|
||||
Three points on the spectrum were considered:
|
||||
|
||||
- **Shallow** — only the raw building-physics intermediates (envelope heat loss W/K, seasonal efficiency, predicted kWh). Model learns kWh→cost→SAP unaided.
|
||||
- **Mid** — Shallow plus the cost reconstruction (`predicted_total_fuel_cost_gbp`, `predicted_ecf`, `predicted_log10_ecf`). Model still has to apply the piecewise SAP rating transform.
|
||||
- **Deep** — Mid plus `predicted_sap_score` with the SAP10.2 §20.1 piecewise log/linear formula pre-applied. Model learns residual only.
|
||||
|
||||
We accept **Mid**. Reasons:
|
||||
|
||||
1. The piecewise SAP rating constants (`SAP = 117 − 121·log10(ECF)` if ECF≥3.5 else `100 − 13.95·ECF`, deflator 0.42) are BRE's, version-bound to SAP10.2. Baking them into a feature means a future SAP10.3 release requires re-deriving features and re-training. Baking them only into the model's learned transform keeps the data layer SAP-version-agnostic.
|
||||
2. `predicted_log10_ecf` is monotonic with `sap_score` by construction. Tree-based models fit monotonic transforms with a small number of splits. The kink at ECF=3.5 is one extra split. We give up almost nothing in accuracy.
|
||||
3. `predicted_sap_score` would clip at high-ECF properties (the log term can push SAP < 1; the formula expects a clamp). `predicted_log10_ecf` has no such pathology.
|
||||
4. We can escalate to Deep in a later slice if Mid leaves residual MAPE above target.
|
||||
|
||||
## Cost reconstruction scope: heating + DHW + lighting
|
||||
|
||||
Total cost in the SAP rating sums: space heating, DHW, lighting, pumps/fans, secondary heating, minus PV credit. We include the first three; we omit the rest:
|
||||
|
||||
- Pumps/fans and secondary heating contribute small (~2–5%) bias that is approximately constant across rows. Tree models learn a constant offset trivially.
|
||||
- PV credit requires a monthly solar simulation (Tables 6, 6d, 6e) — multi-day implementation surface. PV-heavy properties (a small fraction of the high-SAP tail) get a small per-row bias the model can mostly absorb via the PV-fabric features already in v15.x.
|
||||
- Lighting cost share varies materially by heating fuel and floor area; omitting it would create a fuel-mix-conditional bias that is harder to learn. So lighting goes in.
|
||||
|
||||
If a future slice (17+) shows the high-SAP tail still bad after Mid + Lighting lands, the PV monthly simulation gets its own slice.
|
||||
|
||||
## Heat-demand approximation: crude annual
|
||||
|
||||
`predicted_space_heating_kwh` and `predicted_hot_water_kwh` are computed as:
|
||||
|
||||
- `space ≈ envelope_heat_loss_w_per_k × HDH_region × 0.001 / efficiency_main`, where `HDH_region` is heating degree hours per year per SAP region (~22 rows, ~53,000 K·h/yr for the UK average).
|
||||
- `hot_water ≈ 4.18 × Vd × (55 − 12) × 365 × 0.001 / efficiency_water`, with `Vd = 25 × N_occupants + 36` and `N_occupants` defaulted from total floor area per SAP10.2 Appendix J.
|
||||
|
||||
We deliberately do not port SAP10.2's monthly heat balance with solar/internal gains and utilization factors. The crude calculation has 10–30% per-row bias driven by row-correlated factors (solar gains, infiltration, occupancy). The model already sees those factors directly — envelope_heat_loss, region, occupancy proxies — so it can learn the bias as a band-conditional correction without re-deriving the underlying physics. If slice 16h's per-decile residuals (see ADR-0007 baseline + slice 15e tooling) show the crude approximation underperforming, the SAP §3 utilization-factor refinement gets its own slice.
|
||||
|
||||
## Default U-value imputation: cascade
|
||||
|
||||
U-value lookups (Tables 6–10 walls, 16/17/18 roofs, 19+EN ISO 13370 floors, 20 upper floors, 24 windows, 26 doors, 21 thermal-bridging factor) are wrapped in helpers that cascade-default missing fields the same way RdSAP10 §6 does:
|
||||
|
||||
1. Use the cert value if known.
|
||||
2. Fall back to the age-band-typical construction (e.g. cavity for ≥1930, solid brick for pre-1930).
|
||||
3. Fall back to country-typical.
|
||||
4. Final fallback: a mid-band default (1.5 W/m²K for walls).
|
||||
|
||||
`envelope_heat_loss_w_per_k` is therefore never null. The information about "this row had sparse fabric data" is already encoded in the correlated null pattern on the raw fabric features that survive into v16.
|
||||
|
||||
## Extensions: sum-over-all, expose extension_1 only
|
||||
|
||||
`envelope_heat_loss_w_per_k` sums over the main dwelling and every extension (`extension_1`, `extension_2`, `extension_3+`) regardless of how many are present, using each part's own age band and construction. The 250k corpus has:
|
||||
|
||||
| Building parts | Share | Per-extension feature support |
|
||||
|---|---|---|
|
||||
| 1 (main only) | 63.0% | — |
|
||||
| 2 (main + extension_1) | 25.3% | `extension_1_*` populated |
|
||||
| 3+ | 11.7% | aggregate captures, no per-part visibility |
|
||||
|
||||
So `extension_1_*` (renamed from v15.x `secondary_dwelling_*`) fires on 37% of certs and is worth carrying as discrete features. `extension_2_*` would fire on only 11.7% and adds clutter; we drop it. Any heat-loss contribution from extension_2+ flows through the `envelope_heat_loss_w_per_k` aggregate.
|
||||
|
||||
## v16.0.0: a MAJOR feature-schema bump
|
||||
|
||||
Per [ADR-0007](0007-kwh-as-ml-target.md) versioning policy: removing or renaming columns is MAJOR. Slice 16f renames every `secondary_dwelling_*` column to `extension_1_*`. The new physics features (envelope_heat_loss, predicted_*, predicted_ecf, predicted_log10_ecf, etc.) are MINOR additions on their own but ride with the rename in one cut. Result: v15.x → v16.0.0.
|
||||
|
||||
### Cross-repo cutover
|
||||
|
||||
The scoring lambda's tag must match the transform version. The AutoGluon training repo references the v15.x parquet schema. v16.0.0 lands as a coordinated deploy:
|
||||
|
||||
1. Slice 16a–h ships in this repo; v16 parquet generated locally.
|
||||
2. AutoGluon repo updates column references (`secondary_dwelling_*` → `extension_1_*`; consume new physics columns).
|
||||
3. New model artifact tagged v16.0.0.
|
||||
4. Scoring lambda deployed with v16.0.0 tag concurrent with the new artifact.
|
||||
5. v15 lambda retired.
|
||||
|
||||
Until step 4, the live v15 lambda continues serving v15 features against the v15 model. There is no intermediate state where one component is v16 and another v15.
|
||||
|
||||
## Tail-error treatment: LightGBM objective switch, not sample weights
|
||||
|
||||
Slice 16g switches the `sap_score` and `peui_ucl` LightGBM objective from the default `regression` (MSE) to `mape`. The reasoning is that the v15.x training loop reports MAPE while optimising MSE — a known mismatch that under-weights tail rows (a 2-point error at SAP=20 contributes the same squared loss as the same error at SAP=80 but is 4× more visible in MAPE). The `mape` objective applies gradient ∝ 1/|y|, directly compensating.
|
||||
|
||||
Sample-weight schemes (band-bucket reweighting) are deferred. If slice 16h's per-decile residuals show the tails still problematic after the objective switch, weights layer in as 16i. The `co2_emissions` target retains the MSE default because some rows have ~zero CO2 (heavy PV); the `mape` objective destabilises near zero. Per-target objective is configured at training time, not baked into the transform.
|
||||
|
||||
## Consequences
|
||||
|
||||
- The EPC ML Transform owns more domain logic. It now contains the RdSAP10 U-value tables (Tables 6–10, 15–20, 24, 26), the SAP10.2 efficiency lookup (Table 4a), and the Table 32 fuel-price map. These are versioned with the transform; an upstream SAP/RdSAP revision is a transform bump.
|
||||
- The training repo (this repo) and the AutoGluon repo are tightly coupled at parquet column names. Renames are MAJOR bumps with the cutover discipline above. Adding columns is MINOR.
|
||||
- `predicted_log10_ecf` is approximately monotonic with `sap_score` by construction. Down-stream consumers should not treat it as an independent signal.
|
||||
- The physics features are deterministic given cert fields. If two rows have identical fabric+heating+geometry, their `envelope_heat_loss_w_per_k`, `predicted_total_fuel_cost_gbp`, and `predicted_log10_ecf` are identical. The model's residual must therefore explain SAP differences arising from non-deterministic cert calculator nuance (assessor variability, rounding, solar/utilization factors we did not port).
|
||||
- A SAP10.3 release would invalidate the SAP10.2 fuel prices, efficiencies, and rating-formula constants used here. Treat such a release as a transform MAJOR bump with new lookup tables, not a hot-fix.
|
||||
|
||||
## What this ADR does not change
|
||||
|
||||
- The set of ML targets remains the six from ADR-0007: `sap_score`, `co2_emissions`, `peui_raw`, `peui_ucl`, `space_heating_kwh`, `hot_water_kwh`. The new features ride alongside the existing v15 features; nothing in the target set moves.
|
||||
- `energy_rating_current` and any SAP-band-derived field remain excluded from features per ADR-0007.
|
||||
- The `EpcEnergyDerivationService` runtime path is unaffected. Bills and fuel splits remain deterministic from kWh × current Fuel Rates.
|
||||
- The 250k 2025+2026 SAP10 RdSAP corpus continues to be the training set; v16 is a column-schema change, not a data-source change.
|
||||
Loading…
Add table
Reference in a new issue