scaffolding for ml pipeline

This commit is contained in:
Khalim Conn-Kowlessar 2026-05-16 14:15:56 +00:00
parent dfe9e3ddbe
commit 611ff24eb6
14 changed files with 295 additions and 10 deletions

1
.idea/.name generated Normal file
View file

@ -0,0 +1 @@
AGENTS.md

14
.idea/webResources.xml generated Normal file
View file

@ -0,0 +1,14 @@
<?xml version="1.0" encoding="UTF-8"?>
<project version="4">
<component name="WebResourcesPaths">
<contentEntries>
<entry url="file://$PROJECT_DIR$">
<entryData>
<resourceRoots>
<path value="file://$PROJECT_DIR$" />
</resourceRoots>
</entryData>
</entry>
</contentEntries>
</component>
</project>

View file

@ -82,11 +82,11 @@ The EpcPropertyData scored by the modelling pipeline for a single Property, deri
_Avoid_: modelling EPC, working EPC, resolved EPC, derived EPC
**Rebaselining**:
Re-predicting a Property's SAP, carbon emissions, and heat demand via ML so the modelling pipeline scores it against the current SAP10 methodology. Triggered when either (a) the Effective EPC was lodged under a pre-SAP10 schema (`sap_version < 10.0`), so the recorded scores reflect a superseded methodology, or (b) Site Notes / Landlord Overrides changed the physical state of the Property (walls / heating / windows / etc.) so the lodged scores no longer reflect what's installed. Both triggers may fire together. Produces Effective Performance; Lodged Performance is preserved unchanged. Does not include kWh — that is always derived deterministically by EPC Energy Derivation.
Re-predicting a Property's SAP score, CO2 emissions, Primary Energy Intensity, space heating kWh, and hot water kWh via ML so the modelling pipeline scores it against the current SAP10 methodology. Triggered when either (a) the Effective EPC was lodged under a pre-SAP10 schema (`sap_version < 10.0`), so the recorded scores reflect a superseded methodology, or (b) Site Notes / Landlord Overrides changed the physical state of the Property (walls / heating / windows / etc.) so the lodged scores no longer reflect what's installed. Both triggers may fire together. Produces Effective Performance; Lodged Performance is preserved unchanged. kWh is included as ML targets per ADR-0007 — see [[epc-ml-transform]].
_Avoid_: re-scoring, re-prediction, performance recomputation, refresh (for cache-freshness)
**Baseline Performance**:
A Property's current performance aggregate, holding both Lodged Performance and Effective Performance plus annual kWh / fuel split / bills derived from the Effective EPC. Persisted as one row; surfaced as one block in the UI.
A Property's current performance aggregate, holding both Lodged Performance and Effective Performance plus annual space heating kWh, hot water kWh, fuel split, and bills derived from the Effective EPC — kWh values come from the EPC's recorded fields for SAP10 baselines or from ML when Rebaselining fires; bills are derived deterministically from kWh × current Fuel Rates. Persisted as one row; surfaced as one block in the UI.
_Avoid_: baseline predictions, predicted baseline, rebaselined values
**Lodged Performance**:
@ -98,17 +98,39 @@ The SAP / EPC Band / carbon emissions / heat demand the modelling pipeline actua
_Avoid_: modelled performance, rebaselined performance (only correct when rebaselining ran), scored values
**EPC Energy Derivation**:
The deterministic process that derives a Property's annual kWh, fuel split across heating, hot water, lighting, appliances and cooking, and bills from the Effective EPC — applying a UCL Correction for known EPC over/under-prediction and deducing fuel type from the SAP heating fields. No ML.
_Avoid_: kWh prediction, baseline kWh, energy estimation
The process that derives a Property's fuel split and annual bills from its space heating kWh and hot water kWh values plus the heating fuel deduced from SAP fields. kWh values themselves come from the EPC's recorded fields (`renewable_heat_incentive.space_heating_existing_dwelling` and `.water_heating`) for SAP10 baselines, or from ML prediction when Rebaselining fires or when scoring a post-measure state. Bills are computed deterministically from delivered kWh × current Fuel Rates + standing charges + SEG credits. The UCL Correction is no longer applied at runtime — it is folded into ML training labels (see [[epc-ml-transform]] and ADR-0007).
_Avoid_: kWh prediction (kWh is now an ML target — see Rebaselining), baseline kWh, energy estimation
**UCL Correction**:
The per-band linear correction (Few et al. 2023, _Energy & Buildings_ 288 113024) applied to EPC-modelled total primary energy use intensity to align it with metered consumption. Calibrated against gas-heated, non-PV homes in England and Wales rated under SAP 2012; the current implementation extrapolates it to all properties (open question §15.14).
The per-band linear correction (Few et al. 2023, _Energy & Buildings_ 288 113024) that aligns EPC-modelled Primary Energy Intensity with metered consumption. Folded into ML training labels at fit time (per ADR-0007) rather than applied at runtime — the trained model emits metered-equivalent PEUI directly, avoiding the discontinuities at EPC band boundaries that arose when the per-band linear correction was applied post-prediction. Calibrated against gas-heated, non-PV homes in England and Wales rated under SAP 2012; the current implementation extrapolates it to all properties (open question §15.14).
_Avoid_: UCL adjustment, energy correction, metered correction
**EPC Anomaly Flag**:
A per-field indicator that a Property's value for an EPC field differs significantly from Comparable Properties; advisory only — surfaces in the UI to prompt user review, does not block modelling.
_Avoid_: outlier, mismatch, divergence flag
### ML training
**EPC ML Transform**:
The versioned class at `packages/domain/src/domain/ml/transform.py` that maps an EpcPropertyData to a fixed-width row of features + targets. The single ML-data contract between this repo and the AutoGluon training repo. Owns the windows compression, building-parts compression, Top-N Code Taxonomy, and UCL folding decisions. Each version is tagged on the deployed scoring lambda; a mismatch is a deploy-time fail.
_Avoid_: feature builder, ML mapper, EPC vectoriser
**Feature Schema Version**:
The semver version of the EPC ML Transform (e.g. `0.1.0`), included in the parquet output path and the deployed scoring lambda's tag. MAJOR bump when columns are removed or renamed; MINOR when optional columns are added; PATCH for non-behavioural fixes.
_Avoid_: transform version, schema version (overloaded with the SAP RdSAP schema version on EPCs), model version
**Primary Energy Intensity** (**PEUI**):
A Property's total annual primary energy use per square metre of floor area (kWh/m²/yr), the SAP10 quantity recorded as `energy_consumption_current` on the EPC. Covers all end uses (heating, hot water, lighting, appliances, cooking) weighted by SAP primary energy factors per fuel. The quantity the UCL Correction aligns to metered consumption.
_Avoid_: heat demand (which colloquially means the building's space heating thermal requirement — a distinct concept), energy demand, total energy use, kWh per square metre
**PV Capacity Source**:
A flag on the EPC ML Transform feature set indicating whether a Property's PV capacity is `measured` (from `sap_energy_source.photovoltaic_supply[].peak_power`), `estimated_from_roof_area` (the `percent_roof_area` fallback used when the surveyor could not confirm array configuration), or `none` (no PV present). Lets the model weight the correct capacity signal per property.
_Avoid_: PV source, PV configuration type, solar source
**Top-N Code Taxonomy**:
The empirical top-N SAP code list (covering ~95% of mass on the training sample) committed by the EPC ML Transform for each list-aggregated categorical field (`wall_construction`, `glazing_type`, `frame_material`, etc.). Rare codes go into a per-field `_other` bucket. The taxonomy is locked at each Feature Schema Version; changes warrant a MINOR bump (adding) or MAJOR bump (removing codes).
_Avoid_: code list, code dictionary, vocab
### Reference data
**Fuel Rates**:
@ -214,8 +236,8 @@ _Avoid_: API key, auth token, secret
- A **UPRN** identifies a physical dwelling permanently; it does not change when the property changes owner — but each portfolio gets its own **Property** keyed against it.
- When a **Property** has both **Site Notes** and a public **EPC**, the newer of the two derives the **Effective EPC**. **Landlord Overrides** apply only when the **EPC** is the source — never when **Site Notes** are.
- A Property's **Baseline Performance** holds two halves: **Lodged Performance** (the gov register's SAP / band / carbon / heat) and **Effective Performance** (what the modelling pipeline scored against). The two are equal unless **Rebaselining** fires.
- **Rebaselining** produces **Effective Performance** by ML re-prediction when either (a) the Effective EPC was lodged under a pre-SAP10 schema, or (b) the Effective EPC's physical state diverges from the lodged EPC. **Lodged Performance** is never overwritten.
- **EPC Energy Derivation** contributes the annual kWh, fuel split, and bills on every Property unconditionally, reading current **Fuel Rates** and **Carbon Factors** from their respective repos.
- **Rebaselining** produces **Effective Performance** by ML re-prediction across SAP score, CO2 emissions, Primary Energy Intensity, space heating kWh, and hot water kWh, when either (a) the Effective EPC was lodged under a pre-SAP10 schema, or (b) the Effective EPC's physical state diverges from the lodged EPC. **Lodged Performance** is never overwritten.
- **EPC Energy Derivation** derives **fuel split** and **bills** from kWh values (sourced from the EPC's `renewable_heat_incentive` fields for baseline SAP10 properties, or from ML when Rebaselining fires), reading current **Fuel Rates** and **Carbon Factors** from their respective repos.
- The **EPC Prediction Service** uses **Comparable Properties** for both gap-filling and producing **EPC Anomaly Flags**.
- A **Scenario** carries one or more ordered **Scenario Phases**. Triggering the model against N Scenarios produces N **Plans** per Property; each Plan carries an ordered list of **Plan Phases** matching the Scenario's shape.
- Each **Plan Phase** holds its **Optimised Package**, the ending state snapshot, and any **Rolled-over Options** that flow as candidates into the next Plan Phase. A single-phase Scenario is one Scenario Phase with all measure types allowed; the same machinery handles it.
@ -227,7 +249,7 @@ _Avoid_: API key, auth token, secret
> **Dev:** "A landlord uploads a corrected boiler for one of their properties. What happens?"
>
> **Domain expert:** "That's a **Landlord Override** on the heating fields. Save it against the **Property**. The **Effective EPC** has changed, so **Rebaselining** runs to re-predict SAP / carbon / heat, and **EPC Energy Derivation** re-runs to update kWh / bills based on the new fuel deduction. With fresh **Baseline Performance** we regenerate **Recommendations**."
> **Domain expert:** "That's a **Landlord Override** on the heating fields. Save it against the **Property**. The **Effective EPC** has changed, so **Rebaselining** runs to re-predict SAP / carbon / PEUI / space heating kWh / hot water kWh, and **EPC Energy Derivation** re-runs to update the fuel split and bills based on the new kWh values and fuel deduction. With fresh **Baseline Performance** we regenerate **Recommendations**."
> **Dev:** "What if the same Property also has Site Notes?"
>

View file

@ -0,0 +1,13 @@
# `BaselinePerformance` stores both lodged and effective values
A Property's current performance has two states we care about: the rating that was lodged on the government register (the "lodged" SAP / band / carbon / heat) and the rating produced by the modelling pipeline against the current Effective EPC (the "effective" values, which may have been rebaselined by ML when the EPC was pre-SAP10 or when Landlord Overrides / Site Notes changed physical state). We considered storing a single set of values — the rebaselined-if-needed-otherwise-lodged figures — and rejected that. Both are stored as a pair on every `BaselinePerformance`, equal when no rebaselining trigger fires.
The pair lets the FE show "this is what the gov register says vs this is the SAP10-equivalent we modelled against" side by side without a second query, and keeps the audit trail clean: a user looking at a property's plan can see exactly which figure drove the recommendation pipeline. Storing only one set forces a downstream consumer to recompute the missing one from raw EPC fields when it needs both, which is the kind of derivation creep we want to keep out of the FE.
The cost is a wider row + the discipline that **every** `BaselinePerformance` populates both halves, even when they're equal. Annual kWh, fuel split and bills are not paired — they are always derived deterministically by `EpcEnergyDerivationService` against the Effective state, because the EPC's recorded cost fields use fuel rates pinned to the inspection date and the UCL correction depends on the modelled band.
## Consequences
- Schema migration: `property_details_epc` (or its successor) carries 8 fields instead of 4 for the SAP-equivalent block.
- Reversing this means rewriting every consumer that has learned to read both values. Hard to roll back once the FE depends on the pair.
- The rebaseline trigger has two reasons (`pre_sap10`, `physical_state_changed`, or `both`) — store the reason alongside so we know *why* a property was rebaselined when debugging.

View file

@ -0,0 +1,14 @@
# Multi-phase scenarios with per-phase recompute against rolling state
The Scenario aggregate becomes ordered phases: each phase has a measure-type allowlist, an optional budget, and an optional goal. The `ModellingPipeline` walks the phases in order; for each phase it (1) generates candidate recommendations restricted to the phase's measure types, (2) re-runs `ImpactPredictionService` against the **rolling** Effective EPC state (baseline for phase 1; post-phase-1 for phase 2; etc.), (3) optimises within the phase's budget/goal, (4) applies the selected package and rolls the state forward. We considered scoring all measures once against the baseline and slicing the scored list by phase, and rejected that.
Per-phase recompute makes phase ordering load-bearing in the optimisation, not decorative. Installing fabric measures before a heat pump materially changes the heat pump's SAP impact; a single-pass-against-baseline pipeline forces that fact into the optimiser as a hard rule rather than a derived effect, and any cross-measure interaction we don't know to encode becomes silent error. The cost is ML calls scaling with `N_phases × N_scenarios × N_candidate_measures` per property — multi-phase scenarios pay their own ML bill, single-phase scenarios cost the same as today (the loop body runs once).
A single-phase Scenario is `phases: [<one ScenarioPhase>]` with all measure types allowed and the full budget on it. There is no special-case path for single-phase — the pipeline always loops. This avoids two code paths and lets the FE evolve from single-phase to multi-phase without rewiring the backend.
## Consequences
- `Plan` carries `phases: list[PlanPhase]` rather than a flat `OptimisedPackage`. Every consumer of plan output (FE, exports, downstream reports) reads phases.
- The optimiser must accept rolling-state input rather than only baseline state — a generalisation of today's single-shot pass.
- ML cost can be controlled at the scenario layer: keeping a scenario single-phase is the lever for "score once, optimise once" if cost becomes a problem.
- Open future change: SAP impact of a measure is not strictly additive even within a phase. The current per-measure scoring + linear optimisation approximates this. A future iteration may pre-define candidate packages and ML-score whole packages, accepting combinatorial cost for accuracy. Track in PRD §15.

View file

@ -0,0 +1,23 @@
# Baseline kWh and bills are deterministic — no ML on the kWh side
**Status: Superseded by [ADR-0007](0007-kwh-as-ml-target.md).** The premise here — that baseline kWh can be derived from SAP physics alone — held when the gov EPC API did not expose per-end-use kWh. The New EPC API exposes `renewable_heat_incentive.space_heating_existing_dwelling` and `.water_heating` directly, removing the need for ML on the *baseline* side; meanwhile *post-measure* kWh prediction is reintroduced as an ML target to avoid per-band UCL discontinuities at measure-application time. See ADR-0007 for the replacement design.
---
Annual kWh, fuel split, and bills are produced by `EpcEnergyDerivationService` via SAP physics + UCL per-band correction (Few et al. 2023) + per-fuel rates from `FuelRatesRepo`. There is no ML lambda on the kWh path — neither for baseline derivation nor for per-recommendation kWh impact. We considered keeping a kWh ML lambda (the current `model_engine` has two — one pre-recommendation, one post-optimisation) and rejected both.
The forcing facts:
1. The new gov EPC API exposes `energy_consumption_current` (kWh/m², primary) and per-end-use cost fields for the regulated portion of energy use. The decomposition into heating / hot water / lighting that the gov website displays is computed downstream from SAP — SAP itself defines the proportional split deterministically given heating + hot water fuel codes and floor area.
2. The EPC's recorded cost fields use fuel rates pinned to the inspection date, so we discard them and recompute bills from delivered kWh × current `FuelRatesRepo` rate + standing charges + SEG credits.
3. The UCL correction (Few et al.) is an empirical correction on **total annual PEUI**, not on heating-vs-hot-water split — but applied per-band, post-decomposition. The existing `AnnualBillSavings.adjust_energy_to_metered` already ports the per-band gradients/intercepts from Table 3 of the paper.
4. Per-recommendation kWh delta is derivable from the SAP delta predicted by `ImpactPredictionService` + heating-system fuel + COP — no separate ML call needed.
ML is reserved for SAP / carbon / heat demand — the quantities where the physical model is partial and the ML lambda earns its keep. The kWh pipeline is fully deterministic and reproducible, which makes it unit-testable against fakes without an ML lambda, and lets us refresh bills without re-running ML (a fuel-rate update or a new Defra carbon factor publishes new bill figures without touching the modelling lambdas).
## Consequences
- The pre-recommendation kWh ML lambda (`KWH_MODEL_PREFIXES` in [model_api.py](../../backend/ml_models/api.py)) is retired — no consumer in the new pipeline.
- `EpcEnergyDerivationService` becomes a fat deterministic service: SAP physics + UCL + FuelRates lookup + primary-to-delivered conversion. Long but readable.
- Site Notes have no `energy_consumption_current` field (PasHub does not produce one). The deterministic SAP-physics path handles this case naturally — same code, different source of regulated PEUI.
- UCL paper scope (gas-heated, no PV, England + Wales, SAP 2012+) is silently extrapolated to all properties by the current code. Whether to keep silent extrapolation or stratify (no correction for non-gas / PV) is flagged for the per-service grill.
- Adding back a kWh ML lambda later is a real change, not a config tweak — flag it as an ADR if proposed.

View file

@ -0,0 +1,57 @@
# Space heating and hot water kWh are ML targets; UCL is folded into training labels
**Status: Accepted.** Supersedes [ADR-0006](0006-deterministic-kwh-no-baseline-ml.md).
The EPC ML Transform predicts **six targets**: `sap_score`, `co2_emissions`, `peui_raw`, `peui_ucl`, `space_heating_kwh`, `hot_water_kwh`. Two of these (`space_heating_kwh`, `hot_water_kwh`) were explicitly excluded from ML by ADR-0006. We reverse that decision for two independent reasons, the second of which was the deciding factor.
## Why baseline kWh becomes an ML target
The premise of ADR-0006 was that baseline kWh has no clean source in the gov data and must be derived deterministically from SAP physics + UCL correction. That premise no longer holds:
1. The New EPC API exposes `renewable_heat_incentive.space_heating_existing_dwelling` and `renewable_heat_incentive.water_heating` directly as integers (kWh/yr delivered) on every SAP10 certificate. For a SAP10-baseline property, baseline kWh is a lookup, not a derivation — no SAP-physics port required.
2. **But** for the *Rebaselining* path (pre-SAP10 EPCs being scored against SAP10 methodology) and for *post-measure* impact prediction (the state after a measure is installed), no recorded kWh exists. The choice there is: derive deterministically (the ADR-0006 stance), or predict via ML alongside SAP / carbon / heat. Reason (2) below resolves this in favour of ML.
## Why UCL is folded into training labels rather than applied at runtime
The UCL per-band correction (Few et al. 2023) is a piecewise-linear function of PEUI keyed on EPC band. Applied at runtime, post-prediction, it produces a **discontinuity at band boundaries**: when a simulated package of measures pushes a property from band D into band C, the per-band slope/intercept switches discontinuously, and the UCL-adjusted kWh can move in the opposite direction to the underlying PEUI prediction. This was observed in practice on the legacy `model_engine`.
Folding UCL into the training labels — i.e. computing UCL-corrected PEUI per training row using the row's recorded band, then fitting the model on the corrected target — means the trained model emits metered-equivalent PEUI directly. There is no per-band switching at inference. The discontinuity disappears. The model learns a smooth function over the feature space.
The same logic motivates ML prediction of space heating and hot water kWh post-measure: deterministic derivation from a SAP-delta would reintroduce a similar band-boundary artefact at every step where heating efficiency or fuel changes. A single ML model emitting kWh directly is smooth across measure transitions.
## Scope of the reversal
| Quantity | ADR-0006 stance | ADR-0007 stance |
|---|---|---|
| Baseline SAP / carbon / heat demand | ML (unchanged) | ML (unchanged) |
| Baseline PEUI (`peui_raw`) | Read from EPC; UCL-corrected at runtime | Read from EPC at baseline; ML target with UCL-corrected variant (`peui_ucl`) at training time |
| Baseline space heating kWh | Deterministic from SAP physics + UCL | Read from EPC for SAP10 baselines; ML for Rebaselining + post-measure |
| Baseline hot water kWh | Deterministic from SAP physics + UCL | Read from EPC for SAP10 baselines; ML for Rebaselining + post-measure |
| Post-measure space heating kWh delta | Derived from SAP delta + heating fuel/COP | ML target (predicted directly post-measure) |
| Post-measure hot water kWh delta | Derived from SAP delta | ML target (predicted directly post-measure) |
| Fuel split, bills | Deterministic from kWh × Fuel Rates (unchanged) | Deterministic from kWh × Fuel Rates (unchanged) |
| Carbon factors → CO2 emissions | Deterministic from kWh × Carbon Factors (unchanged at runtime) | Deterministic from kWh × Carbon Factors (unchanged at runtime); ML target also separately for Rebaselining |
| UCL correction application point | Runtime, post-prediction, per band | Training time, folded into PEUI labels per row's recorded band |
## Dual PEUI training targets
We train two PEUI variants — `peui_raw` (the EPC's `energy_consumption_current` directly) and `peui_ucl` (the same value with the row's recorded-band UCL correction pre-applied). At v0.1.0 we compare both empirically. The variant with better held-out MAPE wins; the loser is dropped at v0.2.0.
## Label coupling, not classical leakage
The UCL transform uses the row's recorded SAP-derived band to compute the PEUI label, and SAP score is itself an ML target. This couples the two targets at the label level. It is **not** classical leakage (the band is not in the feature set; the model never reads it as input). The PEUI prediction is independent of the SAP prediction at inference. We accept the coupling as the price of avoiding the band-boundary discontinuity, consistent with our explicit "park target-independence" decision — the six targets are predicted independently and small cross-target inconsistencies are tolerated for v1.
Practical safeguard: `energy_rating_current` and any other SAP-score-derived field (e.g. `current_energy_efficiency_band`) are **excluded from the feature set** in the EPC ML Transform, to avoid an entirely separate target-leakage path on the SAP prediction.
## Consequences
- `EpcEnergyDerivationService` is no longer the source of baseline kWh. Its remaining job is the deterministic step from kWh + Fuel Rates → fuel split + bills, and kWh + Carbon Factors → CO2 emissions. UCL is removed from its runtime path; the `AnnualBillSavings.adjust_energy_to_metered` port that ADR-0006 anticipated does not happen — UCL moves into the training-side EPC ML Transform.
- The EPC ML Transform owns both feature definitions *and* the per-row UCL label transformation. It is the single artefact tying SAP-band semantics into the training data; cross-repo consumers (AutoGluon) see only post-transform parquet.
- `FuelRatesRepo`, `CarbonFactorsRepo`, and `HeatingSystemAssumptionsRepo` survive but their `HeatingSystemAssumptionsRepo` consumers shrink — the SAP-physics-decomposition path that ADR-0006 envisaged is unused.
- Adding more ML targets later (lighting kWh, appliance kWh, cooking kWh) becomes a feature-additive change rather than an architectural one — the precedent of "kWh as ML target" is now established.
## What this ADR does not change
- Per-recommendation **cost** delta is still deterministic, from kWh delta × current Fuel Rates.
- Bills surfaced to the UI are always current-rate, never pinned to EPC inspection-date rates.
- `EpcEnergyDerivationService` is preserved as the bills/fuel-split service; only its responsibility shrinks.

0
etl/__init__.py Normal file
View file

View file

@ -0,0 +1,5 @@
"""ML training-data transform — maps EpcPropertyData to ML features + targets.
The single ML-data contract between this repo and the AutoGluon training repo.
See [[epc-ml-transform]] in CONTEXT.md and docs/adr/0007-kwh-as-ml-target.md.
"""

View file

@ -0,0 +1,25 @@
"""Schema dataclasses for EpcMlTransform — the cross-repo ML data contract.
Consumed by the AutoGluon training repo (and by anything else that reads the
transform's parquet output) to know each column's dtype, nullability, and meaning.
"""
from dataclasses import dataclass
@dataclass(frozen=True)
class ColumnSpec:
"""Specification of a single column in the EPC ML training dataset."""
dtype: type
nullable: bool = True
description: str = ""
@dataclass(frozen=True)
class TransformSchema:
"""The cross-repo ML data contract emitted by EpcMlTransform.schema()."""
transform_version: str
feature_columns: dict[str, ColumnSpec]
target_columns: dict[str, ColumnSpec]

View file

@ -0,0 +1,33 @@
"""Tests for EpcMlTransform v0.1.0 — schema-contract surface."""
from domain.ml.schema import ColumnSpec, TransformSchema
from domain.ml.transform import EpcMlTransform
_EXPECTED_TARGET_DTYPES: dict[str, type] = {
"sap_score": int,
"co2_emissions": float,
"peui_raw": int,
"peui_ucl": float,
"space_heating_kwh": float,
"hot_water_kwh": float,
}
def test_transform_advertises_version_and_target_columns() -> None:
# Arrange
transform = EpcMlTransform()
# Act
schema = transform.schema()
# Assert
assert isinstance(schema, TransformSchema)
assert schema.transform_version == "0.1.0"
assert schema.transform_version == EpcMlTransform.VERSION
assert set(schema.target_columns.keys()) == set(_EXPECTED_TARGET_DTYPES.keys())
for target_name, expected_dtype in _EXPECTED_TARGET_DTYPES.items():
column = schema.target_columns[target_name]
assert isinstance(column, ColumnSpec)
assert column.dtype is expected_dtype
assert schema.feature_columns == {}

View file

@ -0,0 +1,78 @@
"""EpcMlTransform — maps EpcPropertyData to ML-ready feature/target columns.
The single ML-data contract between this repo and the AutoGluon training repo.
Versioned semver-style: MAJOR on removing/renaming columns, MINOR on adding.
At v0.1.0 only the schema contract is implemented no feature columns yet.
Features are added incrementally per subsequent vertical slices.
See docs/adr/0007-kwh-as-ml-target.md for the target set and rationale.
"""
from domain.ml.schema import ColumnSpec, TransformSchema
_TARGET_COLUMNS: dict[str, ColumnSpec] = {
"sap_score": ColumnSpec(
dtype=int,
nullable=False,
description="SAP10 energy rating, from `energy_rating_current` on the EPC.",
),
"co2_emissions": ColumnSpec(
dtype=float,
nullable=False,
description="Annual CO2 emissions in tonnes/yr, from `co2_emissions_current`.",
),
"peui_raw": ColumnSpec(
dtype=int,
nullable=False,
description=(
"Primary energy intensity (kWh/m2/yr), from `energy_consumption_current`, "
"untransformed."
),
),
"peui_ucl": ColumnSpec(
dtype=float,
nullable=False,
description=(
"Primary energy intensity (kWh/m2/yr) with Few et al. 2023 per-band UCL "
"correction folded into the training label (ADR-0007)."
),
),
"space_heating_kwh": ColumnSpec(
dtype=float,
nullable=False,
description=(
"Annual space heating delivered kWh, from "
"`renewable_heat_incentive.space_heating_existing_dwelling`."
),
),
"hot_water_kwh": ColumnSpec(
dtype=float,
nullable=False,
description=(
"Annual hot water delivered kWh, from `renewable_heat_incentive.water_heating`."
),
),
}
class EpcMlTransform:
"""Maps an EpcPropertyData to a fixed-width row of ML features + targets.
Version 0.1.0 schema contract only; feature columns added in subsequent slices.
"""
VERSION: str = "0.1.0"
def schema(self) -> TransformSchema:
"""The cross-repo ML data contract.
Returns the column manifest the AutoGluon repo reads to know which
columns are features, which are targets, and their dtypes.
"""
return TransformSchema(
transform_version=self.VERSION,
feature_columns={},
target_columns=dict(_TARGET_COLUMNS),
)

View file

@ -1,8 +1,8 @@
[pytest]
pythonpath = .
pythonpath = . packages/domain/src
log_cli = true
log_cli_level = INFO
addopts = --cov-report term-missing --cov=etl/epc --cov=recommendations --cov=backend --cov=etl/epc_clean --cov=etl/spatial
testpaths = recommendations/tests backend/tests etl/epc/tests etl/epc_clean/tests etl/spatial/tests backend/condition/tests backend/address2UPRN/tests backend/onboarders/tests backend/categorisation/tests backend/export/tests etl/hubspot/tests datatypes/epc/schema/tests datatypes/epc/surveys/tests datatypes/epc/domain/tests backend/ecmk_fetcher/tests/ backend/pashub_fetcher/tests backend/documents_parser/tests backend/magic_plan/tests datatypes/magicplan/api/tests datatypes/magicplan/domain/tests backend/app/db/functions/tests
testpaths = recommendations/tests backend/tests etl/epc/tests etl/epc_clean/tests etl/spatial/tests backend/condition/tests backend/address2UPRN/tests backend/onboarders/tests backend/categorisation/tests backend/export/tests etl/hubspot/tests datatypes/epc/schema/tests datatypes/epc/surveys/tests datatypes/epc/domain/tests backend/ecmk_fetcher/tests/ backend/pashub_fetcher/tests backend/documents_parser/tests backend/magic_plan/tests datatypes/magicplan/api/tests datatypes/magicplan/domain/tests backend/app/db/functions/tests packages/domain/src/domain/ml/tests
markers =
integration: mark a test as an integration test