mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-30 13:10:47 +00:00
Merge pull request #1238 from Hestia-Homes/feature/epc-prediction
Feature/epc prediction
This commit is contained in:
commit
90bed458f4
332 changed files with 5804 additions and 42 deletions
18
CONTEXT.md
18
CONTEXT.md
|
|
@ -62,9 +62,13 @@ A structured dataclass (`domain.addresses.user_address.UserAddress`) capturing a
|
|||
_Avoid_: user input, raw address, user_inputed_address
|
||||
|
||||
**Comparable Properties**:
|
||||
The reference cohort matched to a target Property by both geographic proximity (postcode prefix / UPRN range) and physical similarity (property type, built form, age band); used by the EPC Prediction Service for gap-filling and anomaly detection.
|
||||
The reference cohort matched to a target Property, used by **EPC Prediction** for gap-filling and anomaly detection. Selected by a **filter-then-relax ladder**: hard filters on identity (property type, built form) and any known **Landlord Override** (e.g. a known solid-brick wall) while at least *k* remain, widening the geographic scope (postcode → postcode-prefix) or demoting a known field to a weight when sparse. Survivors are weighted for prediction by **geographic proximity × recency × physical similarity** — closer, newer (newer EPCs are higher quality), more-similar comparables count more. **All cert vintages are kept** as source evidence — a building's physical components (wall / roof / floor / heating fuel / age) are agnostic of the SAP methodology that rated them, so a pre-SAP10 neighbour is valid evidence; recency is a graduated **weight**, never a hard drop, and matters most for the one component that genuinely goes stale (the heating system, when a boiler is replaced). Only the **validation target** is restricted to SAP 10.2 — see [[validation-cohort]] / [[sap-spec-version]] — because performance can only be checked against a same-spec lodged figure.
|
||||
_Avoid_: neighbours, similar properties, peer set
|
||||
|
||||
**EPC Prediction**:
|
||||
Producing a Property's `EpcPropertyData` picture from its **Comparable Properties** when it has no EPC (~30% of UK homes, typically long-tenure). **Deterministic** neighbour synthesis (k-NN-style — *not* ML; no trained model): take the cohort **mode** for the homogeneous categoricals (wall / roof / floor construction + insulation, construction age band), copy a single representative comparable's **structure** wholesale (building parts, per-window dimensions + orientations, floor dimensions) so the picture stays internally consistent for the calculator, then apply **Landlord Overrides** and the known inputs on top. The result is scored through **SAP10 Calculation** like any other **Effective EPC**, so a predicted Property flows through Rebaselining, Bill Derivation, and Modelling unchanged — held in a **distinct predicted-EPC slot** that coexists with any lodged EPC (so provenance is structural and the UI can flag it; see ADR-0031). A **known property type is required** — the hard cohort filter (a flat is never sized from houses) — supplied by a **Landlord Override** (or, later, an Ordnance Survey lookup); a Property whose property type is genuinely unknown is **gated out**, never predicted from a mixed-type cohort and never given a national default. The same cohort machinery also produces **EPC Anomaly Flags** for Properties that *do* have an EPC. A future learned-weighting refinement is possible but separate, as with the calculator's ML residual head.
|
||||
_Avoid_: EpcPredictionService (no "service" suffix — name the operation), ML prediction (it is deterministic), EPC estimation
|
||||
|
||||
### Survey documents
|
||||
|
||||
**Ventilation Audit**:
|
||||
|
|
@ -88,15 +92,15 @@ _Avoid_: patches (deprecated), corrections, manual EPC, edits
|
|||
### Modelling
|
||||
|
||||
**Effective EPC**:
|
||||
The assembled `EpcPropertyData` picture the modelling pipeline scores for a single Property. Assembled from whichever source applies: Site Notes alone; or the public EPC with **Landlord Overrides** applied; or — when the EPC is **old** — its schema re-mapped to current via **Reduced-Field Synthesis** (deterministic, from the cert plus calibrated coefficients — no neighbour data); or — when there is **no EPC** — components **estimated from surrounding properties** (a separate neighbour-prediction ML mechanism, not yet implemented). Carries source-derived physical fields and originally recorded performance values; the performance scored from this picture is held separately in **Baseline Performance**.
|
||||
The assembled `EpcPropertyData` picture the modelling pipeline scores for a single Property. Assembled from whichever source applies: Site Notes alone; or the public EPC with **Landlord Overrides** applied; or — when the EPC is **old** — its schema re-mapped to current via **Reduced-Field Synthesis** (deterministic, from the cert plus calibrated coefficients — no neighbour data); or — when there is **no EPC** — components **estimated from surrounding properties** via **EPC Prediction** (deterministic neighbour synthesis). Carries source-derived physical fields and originally recorded performance values; the performance scored from this picture is held separately in **Baseline Performance**.
|
||||
_Avoid_: modelling EPC, working EPC, resolved EPC, derived EPC
|
||||
|
||||
**Rebaselining**:
|
||||
Establishing a Property's **Effective Performance** (SAP score, EPC Band, CO2, Primary Energy Intensity, space-heating & hot-water kWh) by **assembling the Effective EPC picture and scoring it** through **SAP10 Calculation** (the deterministic `Sap10Calculator`, which superseded the old ML-API rebaseliner; an ML residual head over the calculator is future — ADR-0009/0013). The *assembly* is the substance: apply **Landlord Overrides** (e.g. boiler → ASHP, wall insulated) as a simulation on the `EpcPropertyData`; re-map an old-schema EPC to current via **Reduced-Field Synthesis** (deterministic, cert-only); estimate components from surrounding properties when there is no EPC (neighbour-prediction gap-fill — a separate ML mechanism, not yet implemented). The calculator is the **scoring engine at the tail**, not the whole of Rebaselining — so its call lives inside the Rebaseliner, after assembly. Triggered whenever the assembled picture differs from the lodged record: (a) the EPC was lodged under a methodology the calculator supersedes (`sap_version < 10.2`), (b) Overrides / Site Notes changed the physical state (walls / heating / windows / etc.), or (c) the picture is estimated or remapped rather than a real current EPC. Produces Effective Performance; Lodged Performance is preserved unchanged. The same single scoring also yields the per-end-use kWh that **Bill Derivation** prices — one scoring, two products. kWh is an ML target per ADR-0007 — see [[epc-ml-transform]].
|
||||
Establishing a Property's **Effective Performance** (SAP score, EPC Band, CO2, Primary Energy Intensity, space-heating & hot-water kWh) by **assembling the Effective EPC picture and scoring it** through **SAP10 Calculation** (the deterministic `Sap10Calculator`, which superseded the old ML-API rebaseliner; an ML residual head over the calculator is future — ADR-0009/0013). The *assembly* is the substance: apply **Landlord Overrides** (e.g. boiler → ASHP, wall insulated) as a simulation on the `EpcPropertyData`; re-map an old-schema EPC to current via **Reduced-Field Synthesis** (deterministic, cert-only); estimate components from surrounding properties when there is no EPC (**EPC Prediction** — deterministic neighbour gap-fill). The calculator is the **scoring engine at the tail**, not the whole of Rebaselining — so its call lives inside the Rebaseliner, after assembly. Triggered whenever the assembled picture differs from the lodged record: (a) the EPC was lodged under a methodology the calculator supersedes (`sap_version < 10.2`), (b) Overrides / Site Notes changed the physical state (walls / heating / windows / etc.), or (c) the picture is estimated or remapped rather than a real current EPC. Produces Effective Performance; Lodged Performance is preserved unchanged. The same single scoring also yields the per-end-use kWh that **Bill Derivation** prices — one scoring, two products. kWh is an ML target per ADR-0007 — see [[epc-ml-transform]].
|
||||
_Avoid_: re-scoring, re-prediction, performance recomputation, refresh (for cache-freshness)
|
||||
|
||||
**Reduced-Field Synthesis**:
|
||||
Deterministically translating an **old / reduced-data EPC schema** into the current `EpcPropertyData`, synthesising the *measured* fields the target expects from the source's *reduced or categorical* fields, using only the cert itself plus fixed calibrated coefficients — never neighbour data. Used when re-mapping a **pre-SAP10** cert (e.g. `RdSAP-Schema-20.0.0`) as part of assembling the **Effective EPC**: e.g. a glazing-area *band* + floor area → window m²; bath/shower *room counts* → bath and shower counts. A *best attempt* with no ground truth to validate against (per the **Validation Cohort** rule, a pre-SAP10 cert has no same-spec lodged figure to check), so each synthesis assumption is recorded explicitly in code and tests to keep it debuggable. Distinct from **neighbour-prediction gap-fill** (ML estimation of genuinely-absent fields from surrounding properties — the no-EPC path, a separate mechanism not yet implemented) and from the calculator's own RdSAP Table-5 defaulting in `cert_to_inputs` (which expands `EpcPropertyData` into the full SAP input set downstream).
|
||||
Deterministically translating an **old / reduced-data EPC schema** into the current `EpcPropertyData`, synthesising the *measured* fields the target expects from the source's *reduced or categorical* fields, using only the cert itself plus fixed calibrated coefficients — never neighbour data. Used when re-mapping a **pre-SAP10** cert (e.g. `RdSAP-Schema-20.0.0`) as part of assembling the **Effective EPC**: e.g. a glazing-area *band* + floor area → window m²; bath/shower *room counts* → bath and shower counts. A *best attempt* with no ground truth to validate against (per the **Validation Cohort** rule, a pre-SAP10 cert has no same-spec lodged figure to check), so each synthesis assumption is recorded explicitly in code and tests to keep it debuggable. Distinct from **EPC Prediction** (deterministic neighbour estimation of genuinely-absent fields from surrounding properties — the no-EPC path) and from the calculator's own RdSAP Table-5 defaulting in `cert_to_inputs` (which expands `EpcPropertyData` into the full SAP input set downstream).
|
||||
_Avoid_: gap-fill (means the neighbour-ML path), reduced-data expansion (overloaded with the calculator's Table-5 step), remapping (the schema-translation part only)
|
||||
|
||||
**Baseline Performance**:
|
||||
|
|
@ -125,6 +129,10 @@ _Avoid_: SAP version (ambiguous with the `sap_version` field on the cert, which
|
|||
|
||||
**Validation Cohort**:
|
||||
The subset of corpus certs used to validate **SAP10 Calculation** against **Lodged Performance**, filtered to certs lodged after the calculator's target **SAP Spec Version** rolled out in commercial assessor software — currently `inspection_date ≥ 2025-07-01` (a buffer past 14-03-2025 to allow vendor rollout). Smaller than the full corpus but each cert is comparable under the same spec, so probe MAE is a clean signal of calculator-vs-spec correctness rather than spec-version mixture noise. ADR-0010.
|
||||
|
||||
**Component Accuracy**:
|
||||
The primary, **calculator-independent** measure of **EPC Prediction** quality: how closely the predicted `EpcPropertyData` *components* (heating fuel + category + controls, hot water, wall / roof / floor construction + insulation, age band, glazing, doors, floor area + geometry) match the actual ones, scored by leave-one-out over a held-out target. Categoricals score as a classification hit-rate, numerics as a residual. Load-bearing principle: **predict the components well and correct SAP / carbon / PE fall out once calculator gaps close** — so Component Accuracy is what prediction is tuned against, while `calc(predicted)` vs API-lodged SAP / carbon / PE is a secondary, **calculator-floored** end-to-end check. The held-out target must be a **SAP 10.2** cert (`sap_version == 10.2`) — the only vintage with full-fidelity lodged components — but the source **Comparable Properties** cohort keeps all vintages. Never validated by `calc(predicted)` vs `calc(actual)`: that cancels (and so hides) calculator error against a circular ground truth.
|
||||
_Avoid_: prediction accuracy (vague), SAP accuracy (that is the calculator-floored end-to-end check, not the primary signal)
|
||||
_Avoid_: parity cohort, validation set, corpus sample
|
||||
|
||||
**Measure Application**:
|
||||
|
|
@ -361,7 +369,7 @@ _Avoid_: API key, auth token, secret
|
|||
- A Property's **Baseline Performance** holds two halves: **Lodged Performance** (the gov register's SAP / band / carbon / heat) and **Effective Performance** (what the modelling pipeline scored against). The two are equal unless **Rebaselining** fires.
|
||||
- **Rebaselining** produces **Effective Performance** by ML re-prediction across SAP score, CO2 emissions, Primary Energy Intensity, space heating kWh, and hot water kWh, when either (a) the Effective EPC was lodged under a pre-SAP10 schema, or (b) the Effective EPC's physical state diverges from the lodged EPC. **Lodged Performance** is never overwritten.
|
||||
- **Bill Derivation** derives **fuel split** and **bills** from kWh values (sourced from the EPC's `renewable_heat_incentive` fields for baseline SAP10 properties, or from ML when Rebaselining fires), reading current **Fuel Rates** and **Carbon Factors** from their respective repos.
|
||||
- The **EPC Prediction Service** uses **Comparable Properties** for both gap-filling and producing **EPC Anomaly Flags**.
|
||||
- **EPC Prediction** uses **Comparable Properties** for both gap-filling (the no-EPC path) and producing **EPC Anomaly Flags** (the has-EPC path).
|
||||
- Triggering the model against N **Scenarios** produces N **Plans** per Property. Each **Plan** holds one **Optimised Package** — its selected **Plan Measures** — plus the Property's post-retrofit figures.
|
||||
- A **Scenario Snapshot** is pinned at trigger time per (task, scenario) so mid-run edits to the live Scenario do not affect an in-flight modelling job.
|
||||
- A **Recommendation** references one **Measure Type** and carries property-specific cost and impact.
|
||||
|
|
|
|||
|
|
@ -10,6 +10,7 @@ from sqlmodel import Session
|
|||
from applications.ara_first_run.ara_first_run_trigger_body import (
|
||||
AraFirstRunTriggerBody,
|
||||
)
|
||||
from domain.epc_prediction.epc_prediction import EpcPrediction
|
||||
from domain.property_baseline.calculator_rebaseliner import CalculatorRebaseliner
|
||||
from domain.sap10_calculator.calculator import Sap10Calculator
|
||||
from infrastructure.postgres.config import PostgresConfig
|
||||
|
|
@ -17,8 +18,10 @@ from infrastructure.postgres.engine import make_engine
|
|||
from orchestration.property_baseline_orchestrator import PropertyBaselineOrchestrator
|
||||
from orchestration.ara_first_run_pipeline import AraFirstRunPipeline
|
||||
from orchestration.ingestion_orchestrator import (
|
||||
ComparablesRepo,
|
||||
EpcFetcher,
|
||||
IngestionOrchestrator,
|
||||
PredictionAttributesReader,
|
||||
SolarFetcher,
|
||||
)
|
||||
from orchestration.modelling_orchestrator import ModellingOrchestrator
|
||||
|
|
@ -65,12 +68,23 @@ def build_first_run_pipeline(
|
|||
epc_fetcher: EpcFetcher,
|
||||
geospatial_repo: GeospatialRepository,
|
||||
solar_fetcher: SolarFetcher,
|
||||
comparables_repo: Optional[ComparablesRepo] = None,
|
||||
prediction_attributes_reader: Optional[PredictionAttributesReader] = None,
|
||||
) -> AraFirstRunPipeline:
|
||||
"""Compose the real three-stage pipeline on a Unit-of-Work factory.
|
||||
|
||||
Each stage opens its own unit(s) and commits per batch (ADR-0012); the
|
||||
handler no longer holds a session. The source clients are passed in because
|
||||
their config is not settled — see ``_source_clients_from_env``.
|
||||
|
||||
EPC Prediction gap-fill (ADR-0031) is the predictor itself (pure) plus two
|
||||
injected collaborators: the postcode-cohort source and the Landlord-Override
|
||||
attributes reader. Both default to None, so the feature is **off** until they
|
||||
are supplied — an EPC-less Property is then predicted into its predicted slot.
|
||||
The cohort repo is injected (not built here) because its EPC client is the
|
||||
same source client whose wiring is still pending; the attributes reader is the
|
||||
`property_overrides` read adapter built separately. Until both are passed,
|
||||
ingestion behaves exactly as before.
|
||||
"""
|
||||
return AraFirstRunPipeline(
|
||||
ingestion=IngestionOrchestrator(
|
||||
|
|
@ -78,6 +92,9 @@ def build_first_run_pipeline(
|
|||
epc_fetcher=epc_fetcher,
|
||||
geospatial_repo=geospatial_repo,
|
||||
solar_fetcher=solar_fetcher,
|
||||
comparables_repo=comparables_repo,
|
||||
prediction_attributes_reader=prediction_attributes_reader,
|
||||
epc_prediction=EpcPrediction(),
|
||||
),
|
||||
baseline=PropertyBaselineOrchestrator(
|
||||
unit_of_work=unit_of_work,
|
||||
|
|
|
|||
144
docs/HANDOVER_EPC_PREDICTION.md
Normal file
144
docs/HANDOVER_EPC_PREDICTION.md
Normal file
|
|
@ -0,0 +1,144 @@
|
|||
# EPC Prediction — handover
|
||||
|
||||
Branch `feature/epc-prediction` @ `d8f015fb` (37 ahead of `origin/main`; local-only,
|
||||
not pushed). Tree clean. All ranked backlog (#1222–1228) closed.
|
||||
|
||||
## What this is
|
||||
Deterministic **neighbour synthesis** that predicts a structured `EpcPropertyData`
|
||||
for an EPC-less UK home from its postcode-cohort of neighbours, so it flows through
|
||||
the modelling pipeline. NOT ML. Validation methodology + harness are built; the work
|
||||
is a measurable accuracy backlog.
|
||||
|
||||
## READ FIRST (hold the full state)
|
||||
- Memory `project_epc_prediction` — the spine: design, every commit, metrics, the
|
||||
open fronts, gotchas. Read it first.
|
||||
- `docs/adr/0029-…` (design, 6 forks) and `docs/adr/0030-…component-first.md`
|
||||
(validation methodology — internalise: predict components, SAP/carbon/PE are a
|
||||
calculator-floored *secondary* guard).
|
||||
- Memory `feedback_per_component_best_method` — THE load-bearing principle this
|
||||
session established (see below).
|
||||
- Convention memories: `feedback_aaa_test_convention`,
|
||||
`feedback_abs_diff_over_pytest_approx`, `feedback_commit_per_slice`,
|
||||
`feedback_bigger_slices_for_uniform_work`.
|
||||
|
||||
## The methodology (ADR-0030)
|
||||
- **Component Accuracy is the PRIMARY signal** — predicted vs API-actual components,
|
||||
calculator-free. SAP/CO₂/PE vs lodged is SECONDARY and calculator-floored.
|
||||
- Source cohort keeps ALL cert vintages; only held-out validation TARGETS are
|
||||
SAP 10.2 (`sap_version == 10.2`).
|
||||
- The committed **Tier-1 gate** (`tests/domain/epc_prediction/test_component_accuracy_gate.py`)
|
||||
runs the calculator-free scorer over the frozen anonymised fixture
|
||||
(`tests/fixtures/epc_prediction/`, 36 SAP-10.2 targets) and asserts per-component
|
||||
ratchet floors. Deterministic → exact. **Tighten-only**: when you improve a
|
||||
component, bump its floor in the same commit. A *mapper or fixture change*
|
||||
re-baselines floors (not a regression) — document it.
|
||||
|
||||
## THE PRINCIPLE that drove this session
|
||||
**Give each component its own best-fit synthesis method; never force one global
|
||||
mechanism on all of them.** Validated head-to-head on the harness:
|
||||
- Permanent fabric categoricals (wall, age) → **physical-similarity-weighted mode**
|
||||
(size×age toward cohort centre).
|
||||
- Time-varying components (roof insulation, glazing) → **recency-weighted mode**.
|
||||
- Coherence-coupled cluster (heating) → **coherent whole-cluster donor**, NEVER
|
||||
field-moded.
|
||||
- Point-estimate scalar (floor area) → **cohort median** (MAD-minimising).
|
||||
- Geo-varying components (age, wall, floor, glazing) → additionally **geo-proximity
|
||||
weighted**; roof showed no geo signal → excluded.
|
||||
All live in `domain/epc_prediction/epc_prediction.py` as composable weight vectors
|
||||
(`_similarity_weights` × `_recency_weights` × `_geo_weights`, combined via `_combine`,
|
||||
fed to `_weighted_mode`).
|
||||
|
||||
## Closed this session (#1222 was done before; #1223–1228 this session)
|
||||
- **#1226** per-prediction confidence (`PredictionConfidence`, compute-only;
|
||||
agreement strongly predicts correctness, r=0.582).
|
||||
- **#1224** physical-similarity-weighted categorical mode (wall_insul/roof/floor +1–3pp).
|
||||
- **#1223** per-component, NOT a global recency template: floor-area→cohort median +
|
||||
glazing→recency mode. (A global recency template was rejected — it disturbed the
|
||||
coherence-coupled heating cluster.)
|
||||
- **#1225** coherent heating donor (modal signature = fuel+category+cylinder, recency
|
||||
tie-break). Biggest SAP lever: control 66→74%, SAP MAE 7.08→6.00 pre-merge.
|
||||
- **#1228** PEI investigation — DISPROVED the unit-bug hypothesis (calc/lodged ratio
|
||||
1.06); reframed as calc floor + prediction-sensitivity. Report now surfaces CO₂/PEI
|
||||
calc floors. (Open calc-branch remnant; largely closed by the main merge — see below.)
|
||||
- **#1227** geo-proximity weighting — grilled, signal-checked (STRONG GO, esp. age),
|
||||
built per-component. Batch `GeospatialRepository.coordinates_for_uprns`, coords
|
||||
threaded onto `Comparable`/`PredictionTarget`, haversine kernel (`_GEO_SCALE_KM=0.1`,
|
||||
gate-safe optimum). Intra-postcode lift modest (cohort = 1 postcode); the bigger
|
||||
prize is cross-postcode expansion (deferred, needs dense corpus).
|
||||
- **Corpus grown 40→150 postcodes** (`6e9f8312`); roof-insulation ±1 reporting.
|
||||
- **Merged `origin/main`** (96 commits of calculator/mapper gap fixes, `0b2827e9`).
|
||||
|
||||
## Current metrics (post-merge, 150-pc corpus, 514 SAP-10.2 targets)
|
||||
Component Accuracy (calculator-free): wall 91.2, wall_insul 79.0, age 57.2 (±1 84.7),
|
||||
roof_construction 78.2, floor_construction 79.6, heating_fuel 96.9, heating_category
|
||||
95.7, heating_control 73.9, water_fuel 96.3, water_code 95.3, has_cylinder 89.7,
|
||||
cylinder_insul 52.4, secondary 42.0, roof_insul 49.3 (±1 53.7), floor_insul 94.7,
|
||||
room_in_roof 96.5, glazing 67.3, pv 98.8, solar 99.8.
|
||||
|
||||
Floor area: **MAE 10.48 m² / MAPE 13.2% / typical (median actual) 61 m²** (cohort
|
||||
median, unweighted).
|
||||
|
||||
End-to-end vs lodged (SECONDARY, calculator-floored):
|
||||
SAP pred MAE 6.25 / **calc floor 0.95** (was 1.57 pre-merge, orig 3.25 — the calc
|
||||
fixes nearly validated the calculator, so the gap is now almost all prediction);
|
||||
CO₂ 0.61 / floor 0.18; PEI 39.6 / floor 13.7.
|
||||
|
||||
## Key files
|
||||
- `domain/epc_prediction/epc_prediction.py` — `EpcPrediction.predict`: median floor
|
||||
area + per-component weighted modes + glazing + heating donor + overrides.
|
||||
- `domain/epc_prediction/comparable_properties.py` — `select_comparables` ladder;
|
||||
`Comparable`/`PredictionTarget` (carry `coordinates`).
|
||||
- `domain/epc_prediction/prediction_comparison.py` — `compare_prediction` (25 signals).
|
||||
- `domain/epc_prediction/validation.py` — `iter_predictions` + `evaluate_component_accuracy`
|
||||
(one scorer, calculator-free).
|
||||
- `harness/epc_prediction_corpus.py` — `load_corpus` (+ `_coordinates.json` sidecar),
|
||||
`load_coordinates`, `anonymise_payload`.
|
||||
- `repositories/geospatial/` — `GeospatialRepository.coordinates_for_uprns` (batch).
|
||||
- `scripts/validate_epc_prediction.py` (full report), `build_epc_prediction_fixture.py`,
|
||||
`fetch_epc_prediction_corpus.py`, `fetch_corpus_coordinates.py`.
|
||||
|
||||
## Open fronts (ranked)
|
||||
1. **Geo-weighted floor-area median** — measured quick win: MAE 10.48→**9.77**,
|
||||
MAPE 13.2→12.2%. Swap `_median_floor_area` for a geo-weighted median (reuse
|
||||
`_geo_weights`); gate-check + ratchet the floor_area ceiling. Smallest next slice.
|
||||
2. **Cross-postcode geo expansion** — the real geo payoff (distance-weighted cohort
|
||||
beyond the single postcode). Needs a *densely-sampled* corpus (current 150 are
|
||||
scattered, so a target's true geo-neighbours aren't in-corpus). Design grilled;
|
||||
build a dense corpus first.
|
||||
3. **Slice-5 production wiring** — `ComparableProperties` repo + the
|
||||
`ModellingOrchestrator` owning the EPC *estimation* + distance calcs (a deliberate
|
||||
shift from ADR-0029, which put the fallback in Ingestion). WRITE AN ADR when this
|
||||
lands (it reverses where the fallback lives). Add a provenance marker
|
||||
(`EpcPropertyData` has no predicted/source field yet).
|
||||
4. Weak components with headroom only via NEW signals: age 57% / roof_insul 49%
|
||||
(method-exhausted — confirmed recency/similarity/plain all tie-or-worse);
|
||||
cylinder_insul / secondary are tiny-n.
|
||||
|
||||
## How to run
|
||||
- Token + S3 creds: `set -a; . backend/.env; set +a` (AWS creds mounted at `~/.aws`).
|
||||
- Tests: `PYTHONPATH=. python -m pytest tests/domain/epc_prediction tests/harness/test_epc_prediction_corpus.py tests/repositories/geospatial -o addopts="" -p no:cacheprovider -q`
|
||||
- Full report: `PYTHONPATH=. python scripts/validate_epc_prediction.py` (corpus
|
||||
`/tmp/epc_prediction_corpus`).
|
||||
- Gate is just a pytest test (deterministic, calculator-free).
|
||||
- pyright strict, zero new errors, on every touched file.
|
||||
|
||||
## In-flight / gotchas
|
||||
- **Corpus lives in `/tmp/epc_prediction_corpus`** (gitignored; 150 pc / 3719 certs +
|
||||
`_coordinates.json`). Backed up to `/workspaces/home/epc_prediction_corpus_backup`
|
||||
(persistent host mount — survives container rebuild; `/tmp` does NOT). Coords backup
|
||||
at `/workspaces/home/epc_prediction_corpus_coords_backup.json`. If `/tmp` is wiped,
|
||||
restore from the backup before running the full report.
|
||||
- **Coordinates**: OS Open-UPRN parquet is `DATA_BUCKET/spatial/` (boto3 — s3fs NOT
|
||||
installed; read via `get_object`→BytesIO; `boto3.client` needs
|
||||
`# pyright: ignore[reportUnknownMemberType, reportUnknownVariableType]`). The cert
|
||||
payload carries `uprn` (the join key). The committed fixture ships `_coordinates.json`
|
||||
(OGL OS OpenData) so the gate exercises geo without S3.
|
||||
- **NEVER commit** the API token, `/tmp` corpus, or the coords cache. The
|
||||
`tests/fixtures/epc_prediction` one is anonymised + intentional.
|
||||
- Conventions: AAA test headers; `abs(x-y) <= tol` not `pytest.approx`; commit per
|
||||
slice (stage by name, watch untracked); ADR-cite in commit messages; class is
|
||||
`EpcPrediction` (no "Service").
|
||||
- Per-item workflow: implement TDD red→green on this branch → run the harness →
|
||||
record before/after → ratchet gate floors → `gh issue comment` impact → close.
|
||||
- The merge is **local, not pushed** — push only if asked.
|
||||
- Update memory `project_epc_prediction` as state changes.
|
||||
99
docs/HANDOVER_EPC_PREDICTION_WIRING.md
Normal file
99
docs/HANDOVER_EPC_PREDICTION_WIRING.md
Normal file
|
|
@ -0,0 +1,99 @@
|
|||
# EPC Prediction — production wiring handover (for Jun-te)
|
||||
|
||||
The EPC Prediction **gap-fill** is wired end-to-end behind seams, with one real
|
||||
dependency stubbed: reading an EPC-less Property's resolved Landlord Overrides.
|
||||
This note is what's needed to finish it once your `property_overrides` read path
|
||||
lands. Design is **ADR-0031**; terms in **CONTEXT.md** (EPC Prediction, Effective
|
||||
EPC, EPC Anomaly Flag).
|
||||
|
||||
## What's already built (slices 5a–5e, all on `feature/epc-prediction`)
|
||||
|
||||
- **5a** `Property.predicted_epc` slot + a `"predicted"` `source_path` /
|
||||
`effective_epc` branch — used only when there's no lodged EPC and no Site Notes
|
||||
(a real source always wins).
|
||||
- **5b** `ComparablePropertiesRepository.candidates_for(postcode)` +
|
||||
`EpcComparablePropertiesRepository` adapter (postcode search → per-cert fetch →
|
||||
batched UPRN→coords). Composes with `EpcClientService` + `GeospatialS3Repository`.
|
||||
- **5c** EPC store `source` discriminator (`lodged` | `predicted`) so the two
|
||||
coexist per property; `get_predicted_for_property` / `_for_properties`;
|
||||
`PropertyPostgresRepository` hydrates `predicted_epc`. **Needs a DB migration —
|
||||
see `docs/MIGRATION_NOTE_predicted_epc_source.md`.**
|
||||
- **5d** `build_prediction_target(identity, coords, attributes)` + the eligibility
|
||||
**gate** (unknown `property_type` → not predicted). Override attributes come
|
||||
through the `PredictionTargetAttributesReader` port (the stub).
|
||||
- **5e** `IngestionOrchestrator` wiring: when the three prediction collaborators
|
||||
are injected, an EPC-less Property is predicted from its cohort and persisted to
|
||||
the predicted slot. The collaborators are **optional** — unwired, ingestion is
|
||||
unchanged.
|
||||
|
||||
## Your part — three things
|
||||
|
||||
### 1. Implement `PredictionTargetAttributesReader` (the stub)
|
||||
|
||||
`repositories/property/prediction_target_attributes_reader.py` defines the port:
|
||||
`attributes_for(property_id) -> PredictionTargetAttributes` (property_type,
|
||||
built_form, wall_construction). Build the adapter as a read over the
|
||||
`property_overrides` fact layer (the finaliser writes it via
|
||||
`PropertyOverrideRepository.upsert_all`; you're adding the read side).
|
||||
|
||||
**Code-space gotcha.** `select_comparables` filters
|
||||
`comparable.epc.property_type == target.property_type`, and the cohort EPCs carry
|
||||
gov **API codes** (e.g. `"0"`/`"2"`). Landlord Overrides resolve to enum *value*
|
||||
strings (e.g. `"House"`). Your adapter must map override value → the API-code
|
||||
space, or `property_type` will never match and every cohort comes back empty.
|
||||
Same for `built_form`. (`domain/epc/property_type.py`, `built_form_type.py` are
|
||||
the enums; `datatypes/epc/domain/epc_codes.csv` has the code table.)
|
||||
`property_type` unresolved → return `PredictionTargetAttributes(property_type=None)`
|
||||
so the gate skips the Property.
|
||||
|
||||
### 2. Run the Drizzle migration
|
||||
|
||||
`epc_property.source` column — full spec in
|
||||
`docs/MIGRATION_NOTE_predicted_epc_source.md` (column + default `'lodged'` +
|
||||
relax any `property_id` uniqueness to `(property_id, source)`).
|
||||
|
||||
### 3. Pass the two collaborators at the composition root
|
||||
|
||||
This is now wired: `build_first_run_pipeline` (in `applications/ara_first_run/handler.py`)
|
||||
already constructs `epc_prediction=EpcPrediction()` and accepts the other two as
|
||||
optional params that it threads into the `IngestionOrchestrator`. So the on-switch
|
||||
is just supplying them once they exist:
|
||||
|
||||
```python
|
||||
build_first_run_pipeline(
|
||||
...,
|
||||
comparables_repo=EpcComparablePropertiesRepository(epc_client, geospatial_repo),
|
||||
prediction_attributes_reader=<your property_overrides adapter>, # task #1
|
||||
)
|
||||
```
|
||||
|
||||
`epc_client` is the same EPC source client behind `epc_fetcher` (the concrete
|
||||
`EpcClientService` exposes `search_by_postcode` + `get_by_certificate_number`),
|
||||
so build it alongside the other source clients in `_source_clients_from_env`
|
||||
(pending #1136). Until **both** are passed, ingestion ignores prediction — no
|
||||
orchestrator or handler edits needed, just the two arguments.
|
||||
|
||||
## One open item — Validation Cohort exclusion
|
||||
|
||||
A predicted-source Property has **no real lodged record**, so it must not be
|
||||
scored as if it did (CONTEXT: Validation Cohort; ADR-0031 dec-3). There is **no
|
||||
Validation-Cohort code path today** to exclude it from — when one is built (or in
|
||||
any QA that compares `calc(effective_epc)` vs lodged), exclude on the structural
|
||||
signal:
|
||||
|
||||
```python
|
||||
if prop.source_path == "predicted":
|
||||
continue # predicted EPC — no ground truth to validate against
|
||||
```
|
||||
|
||||
Note too: `PropertyBaselinePerformance.lodged` is derived from `effective_epc`
|
||||
regardless of source (`property_baseline_orchestrator` → `lodged_performance`), so
|
||||
for a predicted Property that "lodged" is synthesised, not real. Decide whether
|
||||
baseline should null/flag it for predicted properties when this lands.
|
||||
|
||||
## Anomaly dual-use (later, not now)
|
||||
|
||||
Slice-5 is gap-fill only (`epc is None`). The slot model already supports
|
||||
predicting for *every* Property to compare predicted vs lodged (**EPC Anomaly
|
||||
Flags**) — see ADR-0031 dec-4. Reuses the same `ComparableProperties` repo + the
|
||||
predicted slot.
|
||||
51
docs/MIGRATION_NOTE_predicted_epc_source.md
Normal file
51
docs/MIGRATION_NOTE_predicted_epc_source.md
Normal file
|
|
@ -0,0 +1,51 @@
|
|||
# Migration note — `epc_property.source` (predicted-EPC slot)
|
||||
|
||||
**For the team to action before merging the EPC Prediction production-wiring
|
||||
branch.** The model-side code is done and tested against the SQLModel-built test
|
||||
DB; the **production Drizzle schema needs a matching column** that this repo does
|
||||
not own.
|
||||
|
||||
## What changed in code (this branch)
|
||||
|
||||
Per **ADR-0031**, a Property can now hold a **lodged** EPC and a **predicted** EPC
|
||||
(EPC Prediction gap-fill) at the same time. The two are distinguished by a new
|
||||
`source` discriminator on the `epc_property` row:
|
||||
|
||||
- `infrastructure/postgres/epc_property_table.py` — `EpcPropertyModel` gains
|
||||
`source: str = Field(default="lodged")`.
|
||||
- `repositories/epc/epc_postgres_repository.py` — `save(..., source="lodged")`
|
||||
writes it; `_delete_for_property` is now per-source (idempotency no longer wipes
|
||||
the other slot); `get_for_property` / `get_for_properties` filter `source =
|
||||
'lodged'`; new `get_predicted_for_property` / `get_predicted_for_properties`
|
||||
read `source = 'predicted'`.
|
||||
|
||||
The test database is built from the SQLModel mirrors via `create_all`, so tests
|
||||
already exercise the column. **Production is not** — hence this note.
|
||||
|
||||
## Required Drizzle migration
|
||||
|
||||
On the `epc_property` table:
|
||||
|
||||
1. **Add column** `source` — `text` (or your enum), **NOT NULL**, **default
|
||||
`'lodged'`**. The default backfills every existing row as a real EPC, which is
|
||||
correct (all current rows are lodged).
|
||||
2. **Relax any single-row-per-property uniqueness.** If a unique constraint /
|
||||
index exists on `epc_property(property_id)`, it must become
|
||||
**`(property_id, source)`** — a property may now have one `lodged` row and one
|
||||
`predicted` row. (Verify whether such a constraint exists; the SQLModel mirror
|
||||
has none, but the production schema may.)
|
||||
3. **Recommended index** `(property_id, source)` — every predicted/lodged read
|
||||
filters on both columns.
|
||||
|
||||
## Allowed values
|
||||
|
||||
`source ∈ {'lodged', 'predicted'}` (see `EpcSource` in
|
||||
`repositories/epc/epc_repository.py`). No other values are written.
|
||||
|
||||
## Why
|
||||
|
||||
ADR-0031: predicted EPCs are stored in their own slot rather than overwriting the
|
||||
lodged `epc`, so (a) provenance is structural — the Validation Cohort excludes
|
||||
predicted-sourced Properties and the UI flags them — and (b) lodged + predicted
|
||||
coexist, which the planned **EPC Anomaly Flags** feature needs (compare a
|
||||
Property's lodged EPC against its predicted one).
|
||||
50
docs/adr/0029-epc-prediction-from-comparable-properties.md
Normal file
50
docs/adr/0029-epc-prediction-from-comparable-properties.md
Normal file
|
|
@ -0,0 +1,50 @@
|
|||
# EPC Prediction from Comparable Properties
|
||||
|
||||
~30% of UK homes (typically long-tenure) have no EPC. **EPC Prediction** produces a Property's `EpcPropertyData` picture from its **Comparable Properties** so an EPC-less Property flows through the rest of the pipeline (Rebaselining, Bill Derivation, Modelling) unchanged. This records the load-bearing design decisions taken in a grill-with-docs session.
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (design). Implementation pending.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Predict the physical picture, score it with our calculator — never a SAP scalar
|
||||
|
||||
Prediction outputs a structured `EpcPropertyData` (building parts, windows, floor dimensions, construction + insulation, age band); SAP / CO2 / PEI / per-end-use kWh come from running `Sap10Calculator` on it. This is the same "assemble a picture, score once" mechanic as every other **Effective EPC** path (Landlord Overrides, Reduced-Field Synthesis), so a predicted Property is fully usable downstream (bills, measures, optimiser) — a directly-aggregated SAP scalar (legacy `SearchEpc`) would be a dead-end number. It also makes the component-classification accuracy metrics meaningful and keeps errors traceable to a wrong component rather than an opaque regression.
|
||||
|
||||
### 2. Deterministic neighbour synthesis, not ML
|
||||
|
||||
No trained model, no learned weights, no fit pipeline: filter a cohort, take categorical modes, copy a representative template, apply overrides. CONTEXT's prior "ML mechanism" framing is corrected — calling it ML invited the wrong architecture (training data, model artifacts, retraining). A future *learned-weighting* refinement is possible but separate, mirroring the calculator's flagged-future ML residual head. The domain class is `EpcPrediction` (no "Service" suffix, per the `BillDerivation` convention).
|
||||
|
||||
### 3. Fetch-phase fallback, behind a domain service + a cohort repository port
|
||||
|
||||
A pure **`EpcPrediction`** domain service (cohort of comparable `EpcPropertyData` in → predicted `EpcPropertyData` out) sits behind a **`ComparableProperties`** repository port that owns the cohort IO (postcode search → per-cert fetch, cached). It wires into `IngestionOrchestrator._fetch`: when `epc_fetcher.get_by_uprn` returns `None`, fetch the cohort and predict, persisting the picture **marked as predicted** (so the UI flags it and the Validation Cohort excludes it). Baseline and Modelling are untouched. Chosen over a fetcher-decorator (hides a heavy cohort fetch behind `get_by_uprn`) and a dedicated stage (a whole stage for "fill the gap when absent", duplicating IO ingestion already does). The heavy cohort IO stays visible in the no-unit IO phase.
|
||||
|
||||
### 4. Hybrid synthesis: cohort-mode categoricals + a coherent structural template
|
||||
|
||||
You cannot average a list of windows (counts differ; a mean orientation is meaningless) or building parts. So:
|
||||
- **Homogeneous categoricals** (wall / roof / floor construction + insulation, age band) → cohort **mode** (robust to one oddball; drives the classification-rate metrics).
|
||||
- **Structure & geometry** (building parts, per-window dimensions + orientations, floor dimensions) → copied wholesale from a **single representative comparable** chosen to be consistent with those modes and closest on geo + size (internally consistent for the calculator; drives the window-area / building-parts / floor-area residual metrics).
|
||||
- **Landlord Overrides** and the known inputs are applied **on top**.
|
||||
|
||||
Rejected: field-by-field aggregation (legacy — incoherent, may not score sensibly) and single-nearest-neighbour copied wholesale (one atypical neighbour sets the categoricals → weaker classification).
|
||||
|
||||
### 5. Cohort selection: filter-then-relax ladder, weighted by geo × recency × similarity
|
||||
|
||||
Selection hard-filters on identity (property type, built form) and any **known Landlord Override** (e.g. solid brick — the mixed-street border case) **while ≥ k comparables remain**, widening the geographic scope (postcode → postcode-prefix) or demoting a known to a strong weight when sparse. Survivors are weighted by **geographic proximity** (true coordinates via `GeospatialRepository`, not the legacy house-number proxy) **× recency** (newer EPCs are higher quality) **× physical similarity**; ~~pre-SAP10 / very old certs are dropped~~ (amended by [ADR-0030](0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md): all vintages are kept — components are methodology-agnostic — with recency as a graduated weight; only the *validation target* must be SAP 10.2). So a known field acts twice: upstream on cohort selection, and again as an override on the final picture.
|
||||
|
||||
### 6. Dual use: gap-fill (no EPC) and anomaly flags (has EPC)
|
||||
|
||||
The same cohort + comparison machinery produces **EPC Anomaly Flags** for Properties that *do* have an EPC (e.g. "all neighbours are 1930s; this lodges 1950 — correct?") — advisory, surfaced for user review. The no-EPC gap-fill lands first; the always-on anomaly-flag wiring is a follow-on increment.
|
||||
|
||||
## Validation
|
||||
|
||||
> **Superseded by [ADR-0030](0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md).** The SAP-version mixing in the cohort makes the lodged-SAP comparisons below (esp. the neighbour-mean baseline) invalid; validation is now component-first over SAP-10.2-only targets. The frozen-corpus leave-one-out shape stands.
|
||||
|
||||
A **frozen postcode-clustered corpus** (a one-off fetch caches N postcodes × all their certs as `EpcPropertyData`) backs an offline, deterministic, repeatable **leave-one-out** harness over thousands of properties: drop a property with an EPC from its own cohort, predict it, compare predicted vs actual. Metrics: **classification rate** on wall / roof / floor construction + insulation and construction age band; **residuals** on SAP, total window area + window count, building-parts count, total floor area. SAP is reported three ways to attribute error — predicted-then-calculated vs (a) lodged SAP (end-to-end), (b) calculator-on-actual-components (isolates prediction error), (c) a direct neighbour-mean SAP baseline (proves predict-then-calculate beats naïve averaging).
|
||||
|
||||
## Open (implementation-level)
|
||||
|
||||
- **Provenance marker** on the picture (predicted vs real) — exact representation TBD; needed for the UI flag and Validation Cohort exclusion.
|
||||
- **No-cohort fallback** when zero comparables survive even after relaxing (low-confidence national property-type + age defaults, or skip-with-flag).
|
||||
- **Confidence signal** (cohort size + agreement) carried for the UI and anomaly thresholds.
|
||||
|
|
@ -0,0 +1,53 @@
|
|||
# EPC Prediction validation is SAP-version-aware and component-first
|
||||
|
||||
**Status: Accepted.** Supersedes the **Validation** section of [ADR-0029](0029-epc-prediction-from-comparable-properties.md) and amends its decision 5 (cohort selection). All other ADR-0029 decisions stand (predict a picture and score it; deterministic neighbour synthesis; fetch-phase fallback; hybrid mode + template synthesis; dual gap-fill / anomaly use).
|
||||
|
||||
## Why this ADR exists
|
||||
|
||||
ADR-0029's validation rested on a three-way SAP comparison, including a **neighbour-mean-of-lodged-SAP baseline** that predict-then-calculate was meant to beat. A second-order problem was invisible when that was written: the gov EPC register spans **multiple SAP spec versions**, and a property's neighbours are mostly *older* certs. In our development corpus only **~16% of certs are SAP 10.2** (`sap_version == 10.2`, schema 21.0.0 / 21.0.1); the rest were lodged under SAP 2012 (RdSAP 9.x). The same dwelling scores a *different* SAP under different spec versions, so:
|
||||
|
||||
- **Averaging neighbours' lodged SAP is invalid** — it blends 2012 and 10.2 numbers to estimate a 10.2 target. The ADR-0029 "baseline" was never a fair comparator; on the real corpus it appeared to *beat* prediction purely as an artifact of this mixing. It is removed.
|
||||
- **Comparing our calculator's output to a neighbour's lodged figure is only meaningful within a same-spec cohort** — the existing **SAP Spec Version** / **Validation Cohort** rule (ADR-0010) already said this for calculator validation; it applies equally here.
|
||||
|
||||
Separately, measuring `calc(predicted)` against the held-out cert's **lodged** SAP conflates two unrelated errors: the *prediction* error and the calculator's own **API-path residual** (~3 SAP on random gov-API certs today — a known, *separate* workstream, since the calculator pins at 1e-4 only on the Elmhurst worksheets). A perfect prediction still scores ~3 off lodged. So lodged-SAP error is the wrong thing to tune prediction against.
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. The source cohort keeps all cert vintages; only the validation **target** is SAP 10.2
|
||||
|
||||
A building's physical **components** (wall / roof / floor / heating fuel / age band) are agnostic of the SAP methodology that rated them — a pre-SAP10 neighbour is valid *evidence* about the street. Dropping pre-SAP10 certs from the cohort (ADR-0029 decision 5) would discard ~84% of neighbours and gut prediction. So: **all vintages stay in the Comparable Properties cohort**, with recency as a graduated *weight* (never a hard drop), mattering most for the one component that genuinely goes stale — the heating system, when a boiler is replaced. Only the held-out **validation target** is restricted to `sap_version == 10.2`, the only vintage with full-fidelity lodged components to check against. (Target selection uses the API `sap_version` field directly, not the `inspection_date ≥ 2025-07-01` proxy.)
|
||||
|
||||
### 2. **Component Accuracy** is the primary, calculator-independent signal
|
||||
|
||||
Prediction is tuned against how closely the predicted `EpcPropertyData` *components* match the actual ones — **not** against any SAP score. Scored by leave-one-out over a 10.2 target: categoricals as a classification hit-rate (with `None` = not-applicable excluded from the denominator), numerics as a residual. Coverage spans the SAP-load-bearing components, led by **heating** (the proven dominant lever — ablating heating to the actual cuts the SAP error from ~7 to ~4.5):
|
||||
|
||||
- **Heating** — main fuel, main category, main control, water-heating fuel/code, has-cylinder, cylinder insulation, secondary heating
|
||||
- **Fabric** — wall / roof / floor construction + insulation, age band (plus a **±1-band** rate, since adjacent bands ≈ same U-value), room-in-roof
|
||||
- **Glazing** — modal glazing type; window count + total-area residuals
|
||||
- **Counts / geometry** — door count, building-parts count, floor area
|
||||
- **Renewables** — PV presence, solar water heating
|
||||
|
||||
Load-bearing principle: **predict the components well and correct SAP / carbon / PE fall out once calculator gaps close.** Component Accuracy makes progress even while the calculator moves underneath us.
|
||||
|
||||
### 3. `calc(predicted)` vs **API-lodged** SAP / carbon / PE is a secondary, calculator-floored check — and two comparisons are rejected
|
||||
|
||||
The end-to-end number — does the predicted picture score like the official 10.2 EPC — is reported but **not** the thing we drive to zero: it is floored by the calculator's API-path residual and improves as *both* prediction and the calculator workstream land. Carbon and PE are *more* version-sensitive than SAP (grid factors shifted sharply between SAP 2012 and 10.2), so they too are compared only on 10.2 targets.
|
||||
|
||||
Rejected:
|
||||
- **`calc(predicted)` vs `calc(actual)`** — cancels (and so *hides*) calculator error against a *circular* ground truth (our own calculator); a systematically wrong prediction in the calculator's blind spot would score perfectly. Not a validation signal; at most an internal attribution diagnostic.
|
||||
- **neighbour-mean-of-lodged-SAP baseline** — mixes SAP versions (see above).
|
||||
|
||||
No synthetic SAP-weighted Component Accuracy index: weighting components by SAP impact reintroduces the calculator. The per-component table stays flat; the end-to-end MAE *is* the holistic rollup.
|
||||
|
||||
### 4. Two validation tiers, one shared scorer
|
||||
|
||||
- **Tier 1 — committed CI gate.** A small, **anonymized**, frozen fixture under `tests/fixtures/` (addresses hashed — the predictor uses address only as a dedup key — `post_town` dropped; postcode + component fields retained; gov data is OGL). A **ratcheting regression gate**: each per-component floor / residual ceiling is the currently-measured value and only ever *tightens* (honouring the repo's no-tolerance-widening ethos); a regression fails the build. End-to-end SAP / carbon / PE thresholds are loose and explicitly **calculator-floored** — gross-regression guards, not targets. Gates when the fixture is present; skips with a message otherwise.
|
||||
- **Tier 1.5 — S3-hosted scale run (near-term).** A few-thousand-cert anonymised corpus stored in **S3** rather than committed to git (too large to commit, but far more statistically stable than the 36-target gate fixture). The integration test pulls it to a temp dir and runs the *same* `load_corpus` + `evaluate_component_accuracy`, then reports / asserts. This is the immediate realization of "validate at scale" — it reuses the committed-fixture machinery wholesale (only the data *source* differs) and needs no bulk-export streaming. Its numbers re-baseline the Tier-1 floors.
|
||||
- **Tier 2 — offline national battle-test (deferred).** Built on `harness/epc_bulk.py` (streams the gov **bulk export** via HTTP range requests, filtered by `sap_version`) and `harness/cohort.py` (offline sweep that **captures per-cert raises** instead of aborting). Streams the register and **buckets by postcode** — because bulk is the *whole* register, every postcode is dense, giving national breadth *and* dense cohorts at once. Over tens of thousands of 10.2 targets it emits the Component Accuracy table, the end-to-end MAE, **and a failure taxonomy** (unsupported-schema, mapper raise, calculator raise, no-cohort, no-10.2-target) — the battle-test half. Not committed, not CI-gated; its numbers periodically **re-baseline the Tier-1 floors**.
|
||||
|
||||
All tiers run the *same* `evaluate_component_accuracy` / `compare_prediction` logic over `load_corpus` — one scorer, three data sources (committed fixture, S3 corpus, bulk stream).
|
||||
|
||||
## Consequences
|
||||
|
||||
- ADR-0029's "Validation" section and its decision-5 clause "pre-SAP10 / very old certs are dropped" are superseded by the above. The CONTEXT terms **Comparable Properties** (all-vintage source) and **Component Accuracy** (new) are updated to match.
|
||||
- The Tier-1 fixture is the first committed gov-EPC fixture sized for *statistical* stability rather than worksheet-exact pinning — a deliberate departure from the repo's 1e-4 pin convention, justified by prediction's irreducible error.
|
||||
87
docs/adr/0031-epc-prediction-production-wiring.md
Normal file
87
docs/adr/0031-epc-prediction-production-wiring.md
Normal file
|
|
@ -0,0 +1,87 @@
|
|||
# EPC Prediction production wiring
|
||||
|
||||
ADR-0029 settled *how* EPC Prediction synthesises a Property's `EpcPropertyData`
|
||||
from its **Comparable Properties**; this records *how it wires into the running
|
||||
pipeline* — where estimation runs, how a predicted EPC is stored and told apart
|
||||
from a lodged one, and which Properties are eligible. Resolved in a
|
||||
grill-with-docs session after the algorithm + validation harness were built and
|
||||
the accuracy backlog (#1222–1228) closed.
|
||||
|
||||
## Status
|
||||
|
||||
Accepted (design). Refines ADR-0029 decision 3. Implementation pending (slice 5).
|
||||
|
||||
## Decisions
|
||||
|
||||
### 1. Estimation runs in Ingestion — the #1227 "shift to Modelling" is dropped
|
||||
|
||||
The cohort fetch + predict happens in `IngestionOrchestrator`, when
|
||||
`epc_fetcher.get_by_uprn` returns `None` — upholding ADR-0029 decision 3. A
|
||||
design note from issue #1227 had proposed moving estimation (and its distance
|
||||
calcs) into `ModellingOrchestrator`; that is reversed here. Ingestion is already
|
||||
the EPC-acquisition phase and *already resolves the Property's coordinates*
|
||||
(`spatial`), so it can run the geo-weighted prediction with no new IO surface;
|
||||
the First Run stages communicate **only through persisted state** (the pipeline
|
||||
threads just `property_ids`, each stage reloads the `Property`), so a prediction
|
||||
produced in Modelling would either have to be persisted there anyway or
|
||||
recomputed every run. No rationale for the Modelling shift survived review.
|
||||
Baseline and Modelling stay untouched — they read a populated `effective_epc`.
|
||||
|
||||
### 2. The predicted EPC is persisted in a distinct slot, never overwriting the lodged one
|
||||
|
||||
Because stages communicate via persisted state, the prediction **must be saved**
|
||||
for Modelling to see it — in-memory-only would never reach stage 3. It is stored
|
||||
as a **distinct predicted-EPC slot** on the Property (the EPC table reused with a
|
||||
`source` discriminator — `lodged` / `predicted`), so a lodged EPC and a predicted
|
||||
EPC can **coexist** on one Property. Coexistence is load-bearing: it is what lets
|
||||
the same cohort machinery produce **EPC Anomaly Flags** for Properties that *do*
|
||||
have an EPC (the dual-use named in ADR-0029), and it means a later real-EPC fetch
|
||||
fills its own slot without the predicted one muddying provenance. Rejected: a
|
||||
single EPC slot with an `is_predicted` flag — it cannot hold both, so it forecloses
|
||||
anomaly detection and makes "lodged later arrives" ambiguous.
|
||||
|
||||
### 3. Provenance is structural — on the Property, not on `EpcPropertyData`
|
||||
|
||||
`EpcPropertyData` gains no `predicted` / `source` field. Which slot the picture
|
||||
came from *is* its provenance. `Property.effective_epc` / `source_path` gain a
|
||||
`"predicted"` branch, used only when there is no lodged EPC **and** no Site Notes
|
||||
(the existing precedence is unchanged; a real source always wins). The
|
||||
**Validation Cohort** then excludes any Property whose `effective_epc` resolves
|
||||
via the predicted slot — it has no same-spec lodged ground truth — and the UI
|
||||
flags it as predicted. Keeping `EpcPropertyData` clean means every downstream
|
||||
consumer (calculator, generators, bill derivation) is unchanged and oblivious to
|
||||
how the picture was sourced, exactly as for Landlord-Override and Reduced-Field
|
||||
pictures.
|
||||
|
||||
### 4. Slice 5 is gap-fill only; always-predict (anomaly) is a follow-on
|
||||
|
||||
Prediction runs only when `epc is None`. Predicting for *every* Property to
|
||||
compare against its lodged EPC (EPC Anomaly Flags) is real and the slot model
|
||||
supports it, but it triples the ingestion cohort IO and needs its own
|
||||
comparison + divergence-threshold + UI surface — so it does not ride in on the
|
||||
wiring slice. The predicted-EPC slot and the `ComparableProperties` repository
|
||||
this slice introduces are exactly what the anomaly capability reuses.
|
||||
|
||||
### 5. A known property type is required — eligibility is gated, never defaulted
|
||||
|
||||
A `PredictionTarget` needs `postcode` (from `PropertyIdentity`), `coordinates`
|
||||
(geospatial), and `property_type` + `built_form` + `wall_construction` from
|
||||
**Landlord Overrides**. `property_type` is the **hard** cohort filter (a flat
|
||||
must not be sized from houses), so it is a **required input**: a Property whose
|
||||
property type is genuinely unknown is **gated out** before prediction — flagged
|
||||
un-predictable, never predicted from a mixed-type cohort and never given a
|
||||
national default. (An Ordnance Survey `postcode_search` source can supply
|
||||
property type more broadly than landlord input does; wiring it is a later
|
||||
enhancement that widens the eligible population — out of scope here.) The
|
||||
`ComparableProperties` repository port deferred by ADR-0029 is built in this
|
||||
slice: it owns the cohort IO (postcode search → per-cert fetch → UPRN→coordinate
|
||||
resolution) and returns candidate `Comparable`s for the domain
|
||||
`select_comparables` to filter.
|
||||
|
||||
## Consequences
|
||||
|
||||
- A schema change: the EPC store gains a `source` discriminator (or equivalent),
|
||||
and the Property repository a `get_predicted_for_property` read.
|
||||
- Slice 5's reach is bounded to EPC-less Properties with a landlord-supplied
|
||||
property type. The `postcode_search` integration is the lever to broaden it.
|
||||
- `EpcPropertyData` stays unchanged, so no downstream consumer is touched.
|
||||
69
domain/epc_prediction/README.md
Normal file
69
domain/epc_prediction/README.md
Normal file
|
|
@ -0,0 +1,69 @@
|
|||
# EPC Prediction
|
||||
|
||||
Predict a structured `EpcPropertyData` for an **EPC-less** UK home from its
|
||||
postcode neighbours, so it flows through the rest of the pipeline (Baseline, Bill
|
||||
Derivation, Modelling) exactly like a home that has an EPC. It is **deterministic
|
||||
neighbour synthesis** — cohort modes + a coherent template + per-component
|
||||
weighting — **not ML**. ~30% of UK homes (typically long-tenure) have no EPC.
|
||||
|
||||
- **Design**: [ADR-0029](../../docs/adr/0029-epc-prediction-from-comparable-properties.md) (algorithm),
|
||||
[ADR-0030](../../docs/adr/0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md) (validation),
|
||||
[ADR-0031](../../docs/adr/0031-epc-prediction-production-wiring.md) (production wiring).
|
||||
- **Glossary**: see *EPC Prediction*, *Comparable Properties*, *Component
|
||||
Accuracy*, *EPC Anomaly Flag* in [CONTEXT.md](../../CONTEXT.md).
|
||||
|
||||
## The flow (gap-fill)
|
||||
|
||||
```
|
||||
Ingestion: a Property has no lodged EPC (epc_fetcher.get_by_uprn → None)
|
||||
│
|
||||
├─ resolve its attributes (property_type/built_form/wall) from Landlord Overrides
|
||||
│ └─ property_type unknown? → GATED OUT, not predicted (no national defaults)
|
||||
├─ build a PredictionTarget (postcode + coordinates + attributes)
|
||||
├─ ComparableProperties repo: fetch the postcode cohort (search → per-cert → coords)
|
||||
├─ select_comparables(): filter to the reference cohort (type-hard, built-form-soft)
|
||||
├─ EpcPrediction.predict(): synthesise the picture (modes + template + donor + weights)
|
||||
└─ persist to the Property's PREDICTED slot (source = "predicted")
|
||||
│
|
||||
Modelling/Baseline: Property.effective_epc returns the predicted picture
|
||||
(source_path == "predicted"), scored like any other Effective EPC.
|
||||
```
|
||||
|
||||
A lodged EPC always wins — prediction is last-resort gap-fill.
|
||||
|
||||
## Where the pieces live
|
||||
|
||||
| Concern | File |
|
||||
|---|---|
|
||||
| Synthesis (modes, template, heating donor, geo/recency/similarity weights) | `epc_prediction.py` |
|
||||
| Cohort selection (filter-then-relax ladder) | `comparable_properties.py` |
|
||||
| Target assembly + eligibility gate | `prediction_target.py` |
|
||||
| Cohort IO port + EPC-API/geospatial adapter | `repositories/comparable_properties/` |
|
||||
| Predicted-EPC persistence (`source` discriminator) | `repositories/epc/` |
|
||||
| `predicted` source path on the aggregate | `domain/property/property.py` |
|
||||
| Ingestion wiring (gate → predict → persist) | `orchestration/ingestion_orchestrator.py` |
|
||||
| Validation (leave-one-out, component-first) + ratcheting gate | `validation.py`, `tests/domain/epc_prediction/test_component_accuracy_gate.py` |
|
||||
|
||||
## See it run
|
||||
|
||||
`tests/e2e/test_epc_prediction_e2e.py` — the whole flow against the real DB +
|
||||
repos, only the external HTTP clients faked. Start there.
|
||||
|
||||
## Status
|
||||
|
||||
Algorithm + validation: **built**. Production gap-fill wiring: **built behind
|
||||
seams** (slices 5a–5e). Two things finish it — a DB migration and the
|
||||
`property_overrides` read adapter — see
|
||||
[the wiring handover](../../docs/HANDOVER_EPC_PREDICTION_WIRING.md) and
|
||||
[the migration note](../../docs/MIGRATION_NOTE_predicted_epc_source.md).
|
||||
**EPC Anomaly Flags** (predict for *every* home, compare to lodged) is the
|
||||
designed next step the storage already supports.
|
||||
|
||||
## Run the tests
|
||||
|
||||
```bash
|
||||
PYTHONPATH=. python -m pytest tests/e2e/test_epc_prediction_e2e.py \
|
||||
tests/domain/epc_prediction tests/orchestration/test_ingestion_prediction.py \
|
||||
tests/repositories/comparable_properties tests/repositories/epc/test_epc_predicted_slot.py \
|
||||
-o addopts="" -q
|
||||
```
|
||||
0
domain/epc_prediction/__init__.py
Normal file
0
domain/epc_prediction/__init__.py
Normal file
126
domain/epc_prediction/comparable_properties.py
Normal file
126
domain/epc_prediction/comparable_properties.py
Normal file
|
|
@ -0,0 +1,126 @@
|
|||
"""Comparable Properties selection for EPC Prediction (ADR-0029).
|
||||
|
||||
Given a `PredictionTarget` (the known inputs for an EPC-less Property) and the
|
||||
raw postcode cohort of candidate `ComparableProperty` objects, `select_comparables`
|
||||
chooses the reference cohort EPC Prediction synthesises from. Pure domain logic —
|
||||
the cohort IO (postcode search → per-cert fetch) lives behind a repository port.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from datetime import date
|
||||
from typing import Callable, Optional
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
||||
from domain.epc_prediction.prediction_target import PredictionTarget
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
|
||||
# Default floor on the cohort: a conditioning filter (built form, a known
|
||||
# override) is applied only while at least this many comparables survive it,
|
||||
# else it is relaxed (ADR-0029 filter-then-relax ladder).
|
||||
_DEFAULT_MINIMUM_COHORT = 5
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ComparableProperty:
|
||||
"""One candidate neighbour: its structured `EpcPropertyData` picture plus the
|
||||
register metadata not carried on the cert (identity for leave-one-out
|
||||
exclusion; recency + address for weighting + re-lodgement dedup)."""
|
||||
|
||||
epc: EpcPropertyData
|
||||
certificate_number: str
|
||||
address: Optional[str] = None
|
||||
registration_date: Optional[date] = None
|
||||
# Resolved from the neighbour's UPRN at the boundary (the harness / modelling
|
||||
# orchestrator), so the pure predictor can weight by physical distance to the
|
||||
# target without an IO dependency. None when no UPRN/coordinate is available.
|
||||
coordinates: Optional[Coordinates] = None
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class ComparableProperties:
|
||||
"""The selected reference cohort for a `PredictionTarget`."""
|
||||
|
||||
members: tuple[ComparableProperty, ...]
|
||||
|
||||
|
||||
def _maybe_filter(
|
||||
cohort: list[ComparableProperty],
|
||||
predicate: Callable[[ComparableProperty], bool],
|
||||
*,
|
||||
active: bool,
|
||||
minimum_cohort: int,
|
||||
) -> list[ComparableProperty]:
|
||||
"""Apply a conditioning filter only while it leaves at least
|
||||
`minimum_cohort` comparables; otherwise relax it (keep the pre-filter
|
||||
cohort) — the filter-then-relax ladder (ADR-0029)."""
|
||||
if not active:
|
||||
return cohort
|
||||
filtered = [c for c in cohort if predicate(c)]
|
||||
return filtered if len(filtered) >= minimum_cohort else cohort
|
||||
|
||||
|
||||
def select_comparables(
|
||||
target: PredictionTarget,
|
||||
candidates: list[ComparableProperty],
|
||||
*,
|
||||
minimum_cohort: int = _DEFAULT_MINIMUM_COHORT,
|
||||
) -> ComparableProperties:
|
||||
"""Select the ComparableProperty Properties for `target` from the raw postcode
|
||||
cohort. The register lists every historical lodgement, so first dedupe each
|
||||
address to its latest cert (one comparable per real neighbour); then property
|
||||
type is an always-hard filter (a flat is never a comparable for a house) and
|
||||
built form is a conditioning filter on the relax ladder."""
|
||||
cohort = _dedupe_to_latest_per_address(candidates)
|
||||
cohort = [
|
||||
c for c in cohort if c.epc.property_type == target.property_type
|
||||
]
|
||||
cohort = _maybe_filter(
|
||||
cohort,
|
||||
lambda c: c.epc.built_form == target.built_form,
|
||||
active=target.built_form is not None,
|
||||
minimum_cohort=minimum_cohort,
|
||||
)
|
||||
cohort = _maybe_filter(
|
||||
cohort,
|
||||
lambda c: _main_wall_construction(c) == target.wall_construction,
|
||||
active=target.wall_construction is not None,
|
||||
minimum_cohort=minimum_cohort,
|
||||
)
|
||||
return ComparableProperties(members=tuple(cohort))
|
||||
|
||||
|
||||
def _dedupe_to_latest_per_address(
|
||||
candidates: list[ComparableProperty],
|
||||
) -> list[ComparableProperty]:
|
||||
"""Collapse the register's re-lodgements: keep one comparable per address —
|
||||
the latest by registration date (ties broken by certificate number, for
|
||||
determinism) — so a re-lodged neighbour does not count more than once.
|
||||
Candidates with no address are passed through untouched (each is its own
|
||||
neighbour). Input order is otherwise preserved."""
|
||||
latest: dict[str, ComparableProperty] = {}
|
||||
passthrough: list[ComparableProperty] = []
|
||||
for c in candidates:
|
||||
if c.address is None:
|
||||
passthrough.append(c)
|
||||
continue
|
||||
incumbent = latest.get(c.address)
|
||||
if incumbent is None or _recency_key(c) > _recency_key(incumbent):
|
||||
latest[c.address] = c
|
||||
return list(latest.values()) + passthrough
|
||||
|
||||
|
||||
def _recency_key(comparable: ComparableProperty) -> tuple[date, str]:
|
||||
"""Sort key making the most recent (then highest cert number) win. A missing
|
||||
registration date sorts oldest."""
|
||||
return (
|
||||
comparable.registration_date or date.min,
|
||||
comparable.certificate_number,
|
||||
)
|
||||
|
||||
|
||||
def _main_wall_construction(comparable: ComparableProperty) -> object:
|
||||
"""The main building part's wall construction, or None when no part lodged."""
|
||||
parts = comparable.epc.sap_building_parts
|
||||
return parts[0].wall_construction if parts else None
|
||||
546
domain/epc_prediction/epc_prediction.py
Normal file
546
domain/epc_prediction/epc_prediction.py
Normal file
|
|
@ -0,0 +1,546 @@
|
|||
"""EPC Prediction synthesis (ADR-0029).
|
||||
|
||||
`EpcPrediction.predict` turns the selected `ComparableProperties` into a
|
||||
predicted `EpcPropertyData`: copy a coherent representative template's structure
|
||||
(building parts, windows, geometry), set the homogeneous categoricals to the
|
||||
recency-weighted cohort mode, then apply Landlord Overrides on top. Pure domain
|
||||
logic — deterministic neighbour synthesis, not ML.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import copy
|
||||
import math
|
||||
import statistics
|
||||
from collections import Counter, defaultdict
|
||||
from dataclasses import dataclass
|
||||
from datetime import date
|
||||
from typing import Callable, Iterable, Optional, Union
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import (
|
||||
EpcPropertyData,
|
||||
MainHeatingDetail,
|
||||
SapBuildingPart,
|
||||
)
|
||||
from domain.epc_prediction.comparable_properties import (
|
||||
ComparableProperty,
|
||||
ComparableProperties,
|
||||
)
|
||||
from domain.epc_prediction.prediction_target import PredictionTarget
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class PredictionConfidence:
|
||||
"""A compute-only confidence signal for a prediction (ADR-0029 open item).
|
||||
|
||||
`cohort_size` is the number of ComparableProperty Properties the prediction drew on;
|
||||
`component_agreement` maps a homogeneous component to the cohort's *agreement*
|
||||
— the modal value's share (0..1) of the neighbours that lodge one. A small or
|
||||
split cohort flags a component downstream may want to treat cautiously (e.g.
|
||||
the per-dwelling fields with a low accuracy ceiling). Surfacing / persisting
|
||||
this is a separate HITL follow-up; here it is computed only.
|
||||
"""
|
||||
|
||||
cohort_size: int
|
||||
component_agreement: dict[str, float]
|
||||
|
||||
def agreement(self, component: str) -> Optional[float]:
|
||||
"""The cohort's modal-value share for a component, or None when no
|
||||
neighbour lodges one (it was not applicable)."""
|
||||
return self.component_agreement.get(component)
|
||||
|
||||
|
||||
class EpcPrediction:
|
||||
"""Synthesises a predicted `EpcPropertyData` from ComparableProperty Properties."""
|
||||
|
||||
def predict(
|
||||
self, target: PredictionTarget, comparables: ComparableProperties
|
||||
) -> EpcPropertyData:
|
||||
"""Predict the target's EPC picture: copy a representative template's
|
||||
structure (coherent for the calculator), set the predicted floor area to
|
||||
the cohort median (the best point estimate of the target's size, decoupled
|
||||
from the one template's own area), then set the homogeneous categoricals
|
||||
to the cohort mode."""
|
||||
template: ComparableProperty = self._template(comparables)
|
||||
predicted: EpcPropertyData = copy.deepcopy(template.epc)
|
||||
predicted.total_floor_area_m2 = _geo_weighted_floor_area(
|
||||
comparables.members, target.coordinates
|
||||
)
|
||||
self._apply_categorical_modes(predicted, comparables, target.coordinates)
|
||||
self._apply_glazing_mode(predicted, comparables, target.coordinates)
|
||||
self._apply_heating_donor(predicted, comparables)
|
||||
self._apply_overrides(predicted, target)
|
||||
return predicted
|
||||
|
||||
@staticmethod
|
||||
def _apply_heating_donor(
|
||||
predicted: EpcPropertyData, comparables: ComparableProperties
|
||||
) -> None:
|
||||
"""Replace the structural template's heating with a coherent donor's whole
|
||||
`SapHeating` cluster (ADR-0029; issue #1225). Heating sub-fields can't be
|
||||
field-moded without breaking system coherence (e.g. a fuel that doesn't
|
||||
match the emitter), so the cluster is copied as a unit from a single
|
||||
neighbour: the one matching the cohort's modal heating *signature* (main
|
||||
fuel + category + cylinder presence), the most recent among those matches
|
||||
(a recent cert reflects the current system). This makes the predicted
|
||||
heating both representative and internally coherent, rather than whatever
|
||||
the size-representative template happened to carry. No donor (no neighbour
|
||||
lodges a main heating system) leaves the template's heating in place."""
|
||||
donor = _heating_donor(comparables.members)
|
||||
if donor is None:
|
||||
return
|
||||
predicted.sap_heating = copy.deepcopy(donor.epc.sap_heating)
|
||||
predicted.has_hot_water_cylinder = donor.epc.has_hot_water_cylinder
|
||||
predicted.solar_water_heating = donor.epc.solar_water_heating
|
||||
|
||||
@staticmethod
|
||||
def _apply_glazing_mode(
|
||||
predicted: EpcPropertyData,
|
||||
comparables: ComparableProperties,
|
||||
target_coordinates: Optional[Coordinates],
|
||||
) -> None:
|
||||
"""Set every window's glazing type to the recency- and geo-weighted cohort
|
||||
mode. Glazing is retrofitted over a dwelling's life (single → double), so
|
||||
a recent neighbour reflects the current state (recency, like roof
|
||||
insulation); it also varies geographically (retrofit waves by street), so
|
||||
a nearer neighbour counts for more. NOT the plain mode (which regressed)
|
||||
or the template copy. The window geometry (size, count) is left on the
|
||||
template; only the glazing categorical moves."""
|
||||
members = comparables.members
|
||||
weights = _combine(
|
||||
_recency_weights(members), _geo_weights(target_coordinates, members)
|
||||
)
|
||||
glazing = _weighted_mode(
|
||||
(_comparable_modal_glazing(c) for c in members), weights
|
||||
)
|
||||
if glazing is None:
|
||||
return
|
||||
for window in predicted.sap_windows:
|
||||
window.glazing_type = glazing
|
||||
|
||||
def confidence(
|
||||
self, comparables: ComparableProperties
|
||||
) -> PredictionConfidence:
|
||||
"""Compute the per-prediction confidence from the cohort: its size plus,
|
||||
for each homogeneous categorical, the modal value's share among the
|
||||
neighbours that lodge one (ADR-0029). Compute-only — it never alters the
|
||||
prediction, only annotates how much the cohort agreed."""
|
||||
members: tuple[ComparableProperty, ...] = comparables.members
|
||||
agreement: dict[str, float] = {}
|
||||
for attr in _MAIN_PART_CATEGORICALS:
|
||||
share: Optional[float] = _modal_share(
|
||||
_main_part_attr(c, attr) for c in members
|
||||
)
|
||||
if share is not None:
|
||||
agreement[attr] = share
|
||||
for attr in _FLOOR_DIM_CATEGORICALS:
|
||||
floor_share: Optional[float] = _modal_share(
|
||||
_main_floor_attr(c, attr) for c in members
|
||||
)
|
||||
if floor_share is not None:
|
||||
agreement[attr] = floor_share
|
||||
return PredictionConfidence(
|
||||
cohort_size=len(members), component_agreement=agreement
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _template(comparables: ComparableProperties) -> ComparableProperty:
|
||||
"""The representative comparable whose structure seeds the prediction:
|
||||
the member whose floor area is closest to the cohort median. A single
|
||||
neighbour's geometry is copied wholesale, so a size-representative
|
||||
template keeps the prediction off the cohort's size outliers (ADR-0029
|
||||
decision 4: closest on size)."""
|
||||
members: tuple[ComparableProperty, ...] = comparables.members
|
||||
median_area: float = statistics.median(
|
||||
c.epc.total_floor_area_m2 for c in members
|
||||
)
|
||||
return min(
|
||||
members,
|
||||
key=lambda c: abs(c.epc.total_floor_area_m2 - median_area),
|
||||
)
|
||||
|
||||
@staticmethod
|
||||
def _apply_categorical_modes(
|
||||
predicted: EpcPropertyData,
|
||||
comparables: ComparableProperties,
|
||||
target_coordinates: Optional[Coordinates],
|
||||
) -> None:
|
||||
"""Override the predicted picture's homogeneous categoricals — wall /
|
||||
roof / floor construction + insulation, age band — with the cohort mode
|
||||
(robust to an atypical template, per ADR-0029 decision 4). The mode is
|
||||
physically-similarity-weighted (decision 5): each neighbour's vote decays
|
||||
with its distance from the cohort's physical centre, so the mode leans on
|
||||
the most representative neighbours. The components that vary
|
||||
*geographically* — age band, wall construction, floor construction (homes
|
||||
built together cluster) — additionally take a geo-proximity weight, so a
|
||||
nearer neighbour counts for more; the rest (e.g. roof construction, which
|
||||
showed no geo signal) do not. The template still supplies the geometry;
|
||||
only the categorical codes move to the mode."""
|
||||
if not predicted.sap_building_parts:
|
||||
return
|
||||
main: SapBuildingPart = predicted.sap_building_parts[0]
|
||||
members = comparables.members
|
||||
similarity: list[float] = _similarity_weights(members)
|
||||
geo: list[float] = _geo_weights(target_coordinates, members)
|
||||
similarity_geo: list[float] = _combine(similarity, geo)
|
||||
for attr in _MAIN_PART_CATEGORICALS:
|
||||
if attr in _RECENCY_WEIGHTED_CATEGORICALS:
|
||||
mode = _recency_weighted_mode(members, attr)
|
||||
else:
|
||||
weights = (
|
||||
similarity_geo
|
||||
if attr in _GEO_WEIGHTED_CATEGORICALS
|
||||
else similarity
|
||||
)
|
||||
mode = _weighted_mode(
|
||||
(_main_part_attr(c, attr) for c in members), weights
|
||||
)
|
||||
if mode is not None:
|
||||
setattr(main, attr, mode)
|
||||
floor_dims = main.sap_floor_dimensions
|
||||
if floor_dims:
|
||||
for attr in _FLOOR_DIM_CATEGORICALS:
|
||||
floor_weights = (
|
||||
similarity_geo
|
||||
if attr in _GEO_WEIGHTED_CATEGORICALS
|
||||
else similarity
|
||||
)
|
||||
floor_mode = _weighted_int_mode(
|
||||
(_main_floor_attr(c, attr) for c in members), floor_weights
|
||||
)
|
||||
if floor_mode is not None:
|
||||
setattr(floor_dims[0], attr, floor_mode)
|
||||
|
||||
@staticmethod
|
||||
def _apply_overrides(
|
||||
predicted: EpcPropertyData, target: PredictionTarget
|
||||
) -> None:
|
||||
"""Apply the known Landlord Overrides on top of the estimate — a known
|
||||
value always wins over the cohort mode (ADR-0029)."""
|
||||
if not predicted.sap_building_parts:
|
||||
return
|
||||
if target.wall_construction is not None:
|
||||
predicted.sap_building_parts[0].wall_construction = (
|
||||
target.wall_construction
|
||||
)
|
||||
|
||||
|
||||
# The homogeneous categoricals carried directly on the main building part. Floor
|
||||
# categoricals live on the main floor dimension and glazing on the windows; both
|
||||
# are handled separately.
|
||||
_MAIN_PART_CATEGORICALS: tuple[str, ...] = (
|
||||
"wall_construction",
|
||||
"wall_insulation_type",
|
||||
"construction_age_band",
|
||||
"roof_construction",
|
||||
"roof_insulation_thickness",
|
||||
)
|
||||
|
||||
# Integer-coded categoricals on the main building part's ground-floor dimension.
|
||||
_FLOOR_DIM_CATEGORICALS: tuple[str, ...] = (
|
||||
"floor_construction",
|
||||
"floor_insulation",
|
||||
)
|
||||
|
||||
# Categoricals whose physical value CHANGES over time (e.g. loft top-ups), so a
|
||||
# recent neighbour reflects the current state better than an old one — these take
|
||||
# a recency-WEIGHTED mode. Permanent categoricals (wall / age) take the plain
|
||||
# mode: recency-weighting them was net-negative on the validation corpus (it
|
||||
# discards data that is still valid). `_RECENCY_TAU_YEARS` is the exponential
|
||||
# decay constant (≈2.8-year half-life), chosen on the corpus (roof insulation
|
||||
# +4pp / +12pp on the fixture).
|
||||
_RECENCY_WEIGHTED_CATEGORICALS: frozenset[str] = frozenset(
|
||||
{"roof_insulation_thickness"}
|
||||
)
|
||||
_RECENCY_TAU_YEARS: float = 4.0
|
||||
_DAYS_PER_YEAR: float = 365.0
|
||||
|
||||
# Physical-similarity weighting of the categorical mode (ADR-0029 decision 5): a
|
||||
# comparable's vote decays exponentially with how far it sits from the cohort's
|
||||
# physical centre — floor area from the median, construction age from the modal
|
||||
# band — so an outlier-sized or outlier-era neighbour can't sway the mode. Scales
|
||||
# chosen on the validation corpus (wall-insulation +2.8pp / roof +1.1pp /
|
||||
# floor-construction +2.4pp / floor-insulation +1.2pp; gate-safe, no regression).
|
||||
_SIMILARITY_SIZE_SCALE_M2: float = 20.0
|
||||
_SIMILARITY_AGE_WEIGHT: float = 0.5
|
||||
_AGE_BAND_ORDER: str = "ABCDEFGHIJKL"
|
||||
|
||||
# Geo-proximity weighting (#1227): a neighbour's vote decays with its haversine
|
||||
# distance to the target, so a closer neighbour counts for more. Applied only to
|
||||
# the components that showed a clear distance signal in the corpus — age band,
|
||||
# wall + floor construction, glazing (homes built / retrofitted together cluster);
|
||||
# roof construction showed no decay, so it is excluded. `_GEO_SCALE_KM` is the
|
||||
# kernel length-scale (chosen on the corpus). Off when the target has no
|
||||
# coordinates; neutral for a neighbour with none (never penalised for missing
|
||||
# data). floor_construction lives on the floor dimension but shares this set.
|
||||
_GEO_SCALE_KM: float = 0.1
|
||||
_GEO_WEIGHTED_CATEGORICALS: frozenset[str] = frozenset(
|
||||
{"construction_age_band", "wall_construction", "floor_construction"}
|
||||
)
|
||||
|
||||
|
||||
def _main_part_attr(
|
||||
comparable: ComparableProperty, attr: str
|
||||
) -> Optional[Union[int, str]]:
|
||||
parts: list[SapBuildingPart] = comparable.epc.sap_building_parts
|
||||
return getattr(parts[0], attr) if parts else None
|
||||
|
||||
|
||||
def _main_floor_attr(comparable: ComparableProperty, attr: str) -> Optional[int]:
|
||||
parts: list[SapBuildingPart] = comparable.epc.sap_building_parts
|
||||
if not parts:
|
||||
return None
|
||||
dims = parts[0].sap_floor_dimensions
|
||||
value: Optional[int] = getattr(dims[0], attr) if dims else None
|
||||
return value
|
||||
|
||||
|
||||
def _geo_weighted_floor_area(
|
||||
members: tuple[ComparableProperty, ...],
|
||||
target_coordinates: Optional[Coordinates],
|
||||
) -> float:
|
||||
"""The cohort's geo-proximity-weighted median floor area — the point estimate
|
||||
of the target's size. The median minimises mean absolute deviation, so it is
|
||||
the best single guess for an unknown neighbour's area; geo-weighting it leans
|
||||
the estimate toward the nearer neighbours, because homes built together share
|
||||
a footprint (the same street signal that already weights age / wall, #1227).
|
||||
Reduces exactly to the plain median when geo weighting is off (no target
|
||||
coordinates ⇒ uniform weights), preserving the MAD-minimising guarantee. Set
|
||||
independently of the structural template (the calculator derives heat loss
|
||||
from the building-part geometry, not this scalar, so the two need not agree)."""
|
||||
weights: list[float] = _geo_weights(target_coordinates, members)
|
||||
return _weighted_median(
|
||||
[
|
||||
(comparable.epc.total_floor_area_m2, weight)
|
||||
for comparable, weight in zip(members, weights)
|
||||
]
|
||||
)
|
||||
|
||||
|
||||
def _weighted_median(values_weights: list[tuple[float, float]]) -> float:
|
||||
"""The weighted median of (value, weight) pairs: the smallest value at which
|
||||
the cumulative weight reaches half the total. When a value's weight splits the
|
||||
total exactly in half, the two straddling values are averaged — so with
|
||||
uniform weights this reduces exactly to `statistics.median` (including the
|
||||
even-count midpoint average). Assumes a non-empty input."""
|
||||
ordered: list[tuple[float, float]] = sorted(values_weights)
|
||||
half: float = sum(weight for _, weight in ordered) / 2
|
||||
cumulative: float = 0.0
|
||||
for index, (value, weight) in enumerate(ordered):
|
||||
cumulative += weight
|
||||
if cumulative > half:
|
||||
return value
|
||||
if cumulative == half and index + 1 < len(ordered):
|
||||
return (value + ordered[index + 1][0]) / 2
|
||||
return ordered[-1][0]
|
||||
|
||||
|
||||
def _age_band_index(comparable: ComparableProperty) -> Optional[int]:
|
||||
"""The main building part's construction-age-band position (A=0 … L=11), or
|
||||
None when no recognisable band is lodged."""
|
||||
band = _main_part_attr(comparable, "construction_age_band")
|
||||
if isinstance(band, str) and band in _AGE_BAND_ORDER:
|
||||
return _AGE_BAND_ORDER.index(band)
|
||||
return None
|
||||
|
||||
|
||||
def _similarity_weights(members: tuple[ComparableProperty, ...]) -> list[float]:
|
||||
"""A physical-similarity weight per comparable (ADR-0029 decision 5): the
|
||||
product of an exponential decay in its floor-area distance from the cohort
|
||||
median and in its age-band distance from the cohort's modal band. A neighbour
|
||||
missing a size or age contributes a neutral weight on that axis, so it is
|
||||
never penalised for absent data. Aligned with `members` index-for-index."""
|
||||
if not members:
|
||||
return []
|
||||
median_area: float = statistics.median(
|
||||
c.epc.total_floor_area_m2 for c in members
|
||||
)
|
||||
age_indices: list[Optional[int]] = [_age_band_index(c) for c in members]
|
||||
present_ages: list[int] = [i for i in age_indices if i is not None]
|
||||
modal_age: Optional[float] = (
|
||||
statistics.median(present_ages) if present_ages else None
|
||||
)
|
||||
weights: list[float] = []
|
||||
for comparable, age_index in zip(members, age_indices):
|
||||
size_term: float = math.exp(
|
||||
-abs(comparable.epc.total_floor_area_m2 - median_area)
|
||||
/ _SIMILARITY_SIZE_SCALE_M2
|
||||
)
|
||||
age_term: float = (
|
||||
math.exp(-_SIMILARITY_AGE_WEIGHT * abs(age_index - modal_age))
|
||||
if modal_age is not None and age_index is not None
|
||||
else 1.0
|
||||
)
|
||||
weights.append(size_term * age_term)
|
||||
return weights
|
||||
|
||||
|
||||
def _weighted_mode(
|
||||
values: Iterable[Optional[Union[int, str]]], weights: list[float]
|
||||
) -> Optional[Union[int, str]]:
|
||||
"""The value with the greatest total similarity weight (ties broken by first
|
||||
appearance, matching `_mode`), or None when no non-None value is present."""
|
||||
totals: dict[Union[int, str], float] = defaultdict(float)
|
||||
for value, weight in zip(values, weights):
|
||||
if value is not None:
|
||||
totals[value] += weight
|
||||
if not totals:
|
||||
return None
|
||||
return max(totals, key=lambda value: totals[value])
|
||||
|
||||
|
||||
def _weighted_int_mode(
|
||||
values: Iterable[Optional[int]], weights: list[float]
|
||||
) -> Optional[int]:
|
||||
"""`_weighted_mode` narrowed to int-coded fields (keeps pyright strict happy
|
||||
when the target attribute is typed `Optional[int]`)."""
|
||||
totals: dict[int, float] = defaultdict(float)
|
||||
for value, weight in zip(values, weights):
|
||||
if value is not None:
|
||||
totals[value] += weight
|
||||
if not totals:
|
||||
return None
|
||||
return max(totals, key=lambda value: totals[value])
|
||||
|
||||
|
||||
def _modal_share(
|
||||
values: Iterable[Optional[Union[int, str]]],
|
||||
) -> Optional[float]:
|
||||
"""The most common value's share of the present (non-None) values — a 0..1
|
||||
measure of how much the cohort agrees — or None when none are present."""
|
||||
present = [v for v in values if v is not None]
|
||||
if not present:
|
||||
return None
|
||||
modal_count: int = Counter(present).most_common(1)[0][1]
|
||||
return modal_count / len(present)
|
||||
|
||||
|
||||
def _combine(left: list[float], right: list[float]) -> list[float]:
|
||||
"""Element-wise product of two aligned weight vectors (compose weighting
|
||||
factors, e.g. similarity × geo-proximity)."""
|
||||
return [a * b for a, b in zip(left, right)]
|
||||
|
||||
|
||||
def _haversine_km(origin: Coordinates, point: Coordinates) -> float:
|
||||
"""Great-circle distance in km between two WGS84 points."""
|
||||
radius_km = 6371.0
|
||||
lat1, lat2 = math.radians(origin.latitude), math.radians(point.latitude)
|
||||
delta_lat = lat2 - lat1
|
||||
delta_lon = math.radians(point.longitude - origin.longitude)
|
||||
h = (
|
||||
math.sin(delta_lat / 2) ** 2
|
||||
+ math.cos(lat1) * math.cos(lat2) * math.sin(delta_lon / 2) ** 2
|
||||
)
|
||||
return 2 * radius_km * math.asin(min(1.0, math.sqrt(h)))
|
||||
|
||||
|
||||
def _geo_weights(
|
||||
target: Optional[Coordinates], members: tuple[ComparableProperty, ...]
|
||||
) -> list[float]:
|
||||
"""A geo-proximity weight per comparable — an exponential decay in haversine
|
||||
distance to the target. All-neutral (1.0) when the target has no coordinates
|
||||
(geo weighting off) or a neighbour has none (never penalised for absent
|
||||
data); aligned with `members` index-for-index."""
|
||||
if target is None:
|
||||
return [1.0] * len(members)
|
||||
weights: list[float] = []
|
||||
for comparable in members:
|
||||
coordinates = comparable.coordinates
|
||||
if coordinates is None:
|
||||
weights.append(1.0)
|
||||
else:
|
||||
weights.append(
|
||||
math.exp(-_haversine_km(target, coordinates) / _GEO_SCALE_KM)
|
||||
)
|
||||
return weights
|
||||
|
||||
|
||||
def _recency_weights(members: tuple[ComparableProperty, ...]) -> list[float]:
|
||||
"""A recency weight per comparable — exponential decay in the cert's age
|
||||
relative to the newest in the cohort, so newer neighbours dominate. All-equal
|
||||
when no registration dates are lodged. Aligned with `members`."""
|
||||
newest: date = max(
|
||||
(c.registration_date or date.min for c in members), default=date.min
|
||||
)
|
||||
return [
|
||||
math.exp(
|
||||
-((newest - (c.registration_date or date.min)).days / _DAYS_PER_YEAR)
|
||||
/ _RECENCY_TAU_YEARS
|
||||
)
|
||||
for c in members
|
||||
]
|
||||
|
||||
|
||||
def _recency_weighted_choice(
|
||||
members: tuple[ComparableProperty, ...],
|
||||
value_of: Callable[[ComparableProperty], Optional[Union[int, str]]],
|
||||
) -> Optional[Union[int, str]]:
|
||||
"""The recency-weighted cohort mode of a per-comparable value: each
|
||||
neighbour's vote decays exponentially with the cert's age relative to the
|
||||
newest in the cohort, so newer neighbours dominate and a stale majority can't
|
||||
outvote the current state. Falls back to a plain mode when no registration
|
||||
dates are lodged (all ages 0 ⇒ equal weight). Returns None when no comparable
|
||||
supplies a value. Used for the time-varying components — those upgraded over a
|
||||
dwelling's life (loft top-ups)."""
|
||||
return _weighted_mode(
|
||||
(value_of(comparable) for comparable in members),
|
||||
_recency_weights(members),
|
||||
)
|
||||
|
||||
|
||||
def _recency_weighted_mode(
|
||||
members: tuple[ComparableProperty, ...], attr: str
|
||||
) -> Optional[Union[int, str]]:
|
||||
"""`_recency_weighted_choice` over a main building-part attribute."""
|
||||
return _recency_weighted_choice(
|
||||
members, lambda comparable: _main_part_attr(comparable, attr)
|
||||
)
|
||||
|
||||
|
||||
def _comparable_modal_glazing(
|
||||
comparable: ComparableProperty,
|
||||
) -> Optional[Union[int, str]]:
|
||||
"""A comparable's modal glazing type — the most common across its windows, or
|
||||
None when it lodges none. One glazing signal per neighbour, robust to a single
|
||||
odd window, matching how the harness scores `modal_glazing_type`."""
|
||||
types = [window.glazing_type for window in comparable.epc.sap_windows]
|
||||
return Counter(types).most_common(1)[0][0] if types else None
|
||||
|
||||
|
||||
def _main_heating_detail(comparable: ComparableProperty) -> Optional[MainHeatingDetail]:
|
||||
"""The primary heating system's detail row, or None when none is lodged."""
|
||||
details = comparable.epc.sap_heating.main_heating_details
|
||||
return details[0] if details else None
|
||||
|
||||
|
||||
def _heating_signature(
|
||||
comparable: ComparableProperty,
|
||||
) -> Optional[tuple[Union[int, str], Optional[int], bool]]:
|
||||
"""The donor-matching signature — main fuel + heating category + cylinder
|
||||
presence: the coarse identity of the heating system. None when no main heating
|
||||
system is lodged, so the comparable is not a donor candidate."""
|
||||
detail = _main_heating_detail(comparable)
|
||||
if detail is None:
|
||||
return None
|
||||
return (
|
||||
detail.main_fuel_type,
|
||||
detail.main_heating_category,
|
||||
comparable.epc.has_hot_water_cylinder,
|
||||
)
|
||||
|
||||
|
||||
def _heating_donor(members: tuple[ComparableProperty, ...]) -> Optional[ComparableProperty]:
|
||||
"""The coherent heating donor: the comparable whose heating signature is the
|
||||
cohort mode, breaking ties toward the most recent cert (then certificate
|
||||
number, for determinism). None when no neighbour lodges a heating system."""
|
||||
signed = [(c, _heating_signature(c)) for c in members]
|
||||
signatures = [sig for _, sig in signed if sig is not None]
|
||||
if not signatures:
|
||||
return None
|
||||
modal = Counter(signatures).most_common(1)[0][0]
|
||||
matches = [c for c, sig in signed if sig == modal]
|
||||
return max(
|
||||
matches,
|
||||
key=lambda c: (c.registration_date or date.min, c.certificate_number),
|
||||
)
|
||||
287
domain/epc_prediction/prediction_comparison.py
Normal file
287
domain/epc_prediction/prediction_comparison.py
Normal file
|
|
@ -0,0 +1,287 @@
|
|||
"""Per-Property prediction comparison for the EPC Prediction validation harness
|
||||
(ADR-0029).
|
||||
|
||||
`compare_prediction` scores a predicted `EpcPropertyData` against the actual one
|
||||
on the accuracy signals the leave-one-out harness aggregates: classification
|
||||
matches on the key categoricals (wall / roof / floor construction + insulation,
|
||||
construction age band) and residuals on the geometry (window area + count,
|
||||
building-parts count, floor area). Pure — the SAP residual is computed in the
|
||||
runner, which has the calculator and the lodged SAP.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from collections import Counter
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import (
|
||||
EpcPropertyData,
|
||||
MainHeatingDetail,
|
||||
SapBuildingPart,
|
||||
)
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class PredictionComparison:
|
||||
"""One Property's prediction accuracy: per-component classification hits +
|
||||
geometry residuals (predicted − actual). `categorical_hits` maps a component
|
||||
name to its hit: True / False, or `None` ("not applicable") when the actual
|
||||
lodges no value there, so the harness can keep it out of the
|
||||
classification-rate denominator rather than score a free win. Keyed by name
|
||||
(not flat fields) so the component set can grow without reshaping the
|
||||
runner — see ADR-0030 Component Accuracy."""
|
||||
|
||||
categorical_hits: dict[str, Optional[bool]]
|
||||
floor_area_residual: float
|
||||
building_parts_residual: int
|
||||
window_count_residual: int
|
||||
total_window_area_residual: float
|
||||
door_count_residual: int
|
||||
|
||||
|
||||
def _main(epc: EpcPropertyData) -> SapBuildingPart:
|
||||
return epc.sap_building_parts[0]
|
||||
|
||||
|
||||
def _main_floor_construction(epc: EpcPropertyData) -> Optional[int]:
|
||||
"""The main building part's ground-floor construction code, or None when no
|
||||
floor dimension is lodged."""
|
||||
dims = _main(epc).sap_floor_dimensions
|
||||
return dims[0].floor_construction if dims else None
|
||||
|
||||
|
||||
def _classify(predicted: object, actual: object) -> Optional[bool]:
|
||||
"""A categorical hit: None ("not applicable") when the actual is absent,
|
||||
else whether the predicted value matches it."""
|
||||
if actual is None:
|
||||
return None
|
||||
return predicted == actual
|
||||
|
||||
|
||||
# RdSAP construction age bands, oldest → newest. Adjacent bands carry near-
|
||||
# identical U-values, so an off-by-one is treated as a (SAP-neutral) ±1 hit.
|
||||
_AGE_BAND_ORDER: str = "ABCDEFGHIJKL"
|
||||
|
||||
|
||||
def _age_band_within_one(predicted: object, actual: object) -> Optional[bool]:
|
||||
"""A ±1-band age hit: None when the actual is absent, True on an exact or
|
||||
adjacent-band match, else False (issue #1222 — exact match overstates the
|
||||
SAP impact of age-band misses)."""
|
||||
if actual is None:
|
||||
return None
|
||||
if predicted == actual:
|
||||
return True
|
||||
if (
|
||||
isinstance(predicted, str)
|
||||
and isinstance(actual, str)
|
||||
and predicted in _AGE_BAND_ORDER
|
||||
and actual in _AGE_BAND_ORDER
|
||||
):
|
||||
return (
|
||||
abs(_AGE_BAND_ORDER.index(predicted) - _AGE_BAND_ORDER.index(actual))
|
||||
<= 1
|
||||
)
|
||||
return False
|
||||
|
||||
|
||||
# RdSAP roof-insulation thickness buckets, thinnest → thickest. Uninsulated is
|
||||
# lodged as either 0 or "NI" (not insulated), so both map to the bottom rung;
|
||||
# "ND" (no data) is off the scale entirely. Adjacent buckets carry near-identical
|
||||
# roof U-values, so an off-by-one bucket is treated as a (SAP-neutral) ±1 hit —
|
||||
# the same measurement honesty as the construction age band (issue #1222).
|
||||
_ROOF_THICKNESS_ORDINAL: dict[object, int] = {
|
||||
0: 0,
|
||||
"NI": 0,
|
||||
"12mm": 1,
|
||||
"25mm": 2,
|
||||
"50mm": 3,
|
||||
"75mm": 4,
|
||||
"100mm": 5,
|
||||
"125mm": 6,
|
||||
"150mm": 7,
|
||||
"175mm": 8,
|
||||
"200mm": 9,
|
||||
"225mm": 10,
|
||||
"250mm": 11,
|
||||
"270mm": 12,
|
||||
"300mm": 13,
|
||||
"350mm": 14,
|
||||
"400mm+": 15,
|
||||
}
|
||||
|
||||
|
||||
def _roof_insulation_within_one(
|
||||
predicted: object, actual: object
|
||||
) -> Optional[bool]:
|
||||
"""A ±1-bucket roof-insulation hit: None when the actual is absent, True on an
|
||||
exact or adjacent-bucket match, else False. Off the ordered scale (e.g. the
|
||||
"ND" no-data category) only an exact match counts."""
|
||||
if actual is None:
|
||||
return None
|
||||
if predicted == actual:
|
||||
return True
|
||||
pred_rung = _ROOF_THICKNESS_ORDINAL.get(predicted)
|
||||
actual_rung = _ROOF_THICKNESS_ORDINAL.get(actual)
|
||||
if pred_rung is None or actual_rung is None:
|
||||
return False
|
||||
return abs(pred_rung - actual_rung) <= 1
|
||||
|
||||
|
||||
def _main_heating_detail(epc: EpcPropertyData) -> Optional[MainHeatingDetail]:
|
||||
"""The primary heating system's detail row, or None when none is lodged."""
|
||||
details = epc.sap_heating.main_heating_details
|
||||
return details[0] if details else None
|
||||
|
||||
|
||||
def _heating_hits(
|
||||
predicted: EpcPropertyData, actual: EpcPropertyData
|
||||
) -> dict[str, Optional[bool]]:
|
||||
"""Classification hits for the heating components — the dominant SAP lever
|
||||
(ADR-0030). Main-system fields come off the primary `MainHeatingDetail`;
|
||||
hot-water + secondary fields off `SapHeating`."""
|
||||
pred_main = _main_heating_detail(predicted)
|
||||
actual_main = _main_heating_detail(actual)
|
||||
pred_h = predicted.sap_heating
|
||||
actual_h = actual.sap_heating
|
||||
return {
|
||||
"heating_main_fuel": _classify(
|
||||
getattr(pred_main, "main_fuel_type", None),
|
||||
getattr(actual_main, "main_fuel_type", None),
|
||||
),
|
||||
"heating_main_category": _classify(
|
||||
getattr(pred_main, "main_heating_category", None),
|
||||
getattr(actual_main, "main_heating_category", None),
|
||||
),
|
||||
"heating_main_control": _classify(
|
||||
getattr(pred_main, "main_heating_control", None),
|
||||
getattr(actual_main, "main_heating_control", None),
|
||||
),
|
||||
"water_heating_fuel": _classify(
|
||||
pred_h.water_heating_fuel, actual_h.water_heating_fuel
|
||||
),
|
||||
"water_heating_code": _classify(
|
||||
pred_h.water_heating_code, actual_h.water_heating_code
|
||||
),
|
||||
"has_hot_water_cylinder": _classify(
|
||||
predicted.has_hot_water_cylinder, actual.has_hot_water_cylinder
|
||||
),
|
||||
"cylinder_insulation_type": _classify(
|
||||
pred_h.cylinder_insulation_type, actual_h.cylinder_insulation_type
|
||||
),
|
||||
"secondary_heating_type": _classify(
|
||||
pred_h.secondary_heating_type, actual_h.secondary_heating_type
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
def _modal_glazing_type(epc: EpcPropertyData) -> Optional[object]:
|
||||
"""The most common glazing type across the dwelling's windows, or None when
|
||||
none are lodged. A single dwelling-level glazing signal, robust to one
|
||||
odd window."""
|
||||
types = [w.glazing_type for w in epc.sap_windows]
|
||||
return Counter(types).most_common(1)[0][0] if types else None
|
||||
|
||||
|
||||
def _has_pv(epc: EpcPropertyData) -> bool:
|
||||
"""True iff the dwelling lodges any photovoltaic supply (either path)."""
|
||||
source = epc.sap_energy_source
|
||||
return source.photovoltaic_supply is not None or bool(
|
||||
source.photovoltaic_arrays
|
||||
)
|
||||
|
||||
|
||||
def _renewables_and_fabric_hits(
|
||||
predicted: EpcPropertyData, actual: EpcPropertyData
|
||||
) -> dict[str, Optional[bool]]:
|
||||
"""Hits for the remaining fabric-insulation, glazing and renewables
|
||||
components (ADR-0030). Presence flags (room-in-roof, PV, solar) are always
|
||||
applicable — predicting absence when present is a real miss."""
|
||||
return {
|
||||
"roof_insulation_thickness": _classify(
|
||||
_main(predicted).roof_insulation_thickness,
|
||||
_main(actual).roof_insulation_thickness,
|
||||
),
|
||||
"roof_insulation_thickness_pm1": _roof_insulation_within_one(
|
||||
_main(predicted).roof_insulation_thickness,
|
||||
_main(actual).roof_insulation_thickness,
|
||||
),
|
||||
"floor_insulation": _classify(
|
||||
_main_floor_insulation(predicted), _main_floor_insulation(actual)
|
||||
),
|
||||
"has_room_in_roof": _classify(
|
||||
_main(predicted).sap_room_in_roof is not None,
|
||||
_main(actual).sap_room_in_roof is not None,
|
||||
),
|
||||
"modal_glazing_type": _classify(
|
||||
_modal_glazing_type(predicted), _modal_glazing_type(actual)
|
||||
),
|
||||
"has_pv": _classify(_has_pv(predicted), _has_pv(actual)),
|
||||
"solar_water_heating": _classify(
|
||||
predicted.solar_water_heating, actual.solar_water_heating
|
||||
),
|
||||
}
|
||||
|
||||
|
||||
def _main_floor_insulation(epc: EpcPropertyData) -> Optional[int]:
|
||||
"""The main building part's ground-floor insulation code, or None when no
|
||||
floor dimension is lodged."""
|
||||
dims = _main(epc).sap_floor_dimensions
|
||||
return dims[0].floor_insulation if dims else None
|
||||
|
||||
|
||||
def _total_window_area(epc: EpcPropertyData) -> float:
|
||||
return sum(w.window_width * w.window_height for w in epc.sap_windows)
|
||||
|
||||
|
||||
def compare_prediction(
|
||||
predicted: EpcPropertyData, actual: EpcPropertyData
|
||||
) -> PredictionComparison:
|
||||
"""Compare a predicted picture against the actual one, field by field. All
|
||||
residuals are signed, predicted − actual."""
|
||||
fabric_hits: dict[str, Optional[bool]] = {
|
||||
"wall_construction": _classify(
|
||||
_main(predicted).wall_construction,
|
||||
_main(actual).wall_construction,
|
||||
),
|
||||
"wall_insulation_type": _classify(
|
||||
_main(predicted).wall_insulation_type,
|
||||
_main(actual).wall_insulation_type,
|
||||
),
|
||||
"construction_age_band": _classify(
|
||||
_main(predicted).construction_age_band,
|
||||
_main(actual).construction_age_band,
|
||||
),
|
||||
"construction_age_band_pm1": _age_band_within_one(
|
||||
_main(predicted).construction_age_band,
|
||||
_main(actual).construction_age_band,
|
||||
),
|
||||
"roof_construction": _classify(
|
||||
_main(predicted).roof_construction,
|
||||
_main(actual).roof_construction,
|
||||
),
|
||||
"floor_construction": _classify(
|
||||
_main_floor_construction(predicted),
|
||||
_main_floor_construction(actual),
|
||||
),
|
||||
}
|
||||
return PredictionComparison(
|
||||
categorical_hits={
|
||||
**fabric_hits,
|
||||
**_heating_hits(predicted, actual),
|
||||
**_renewables_and_fabric_hits(predicted, actual),
|
||||
},
|
||||
floor_area_residual=(
|
||||
predicted.total_floor_area_m2 - actual.total_floor_area_m2
|
||||
),
|
||||
building_parts_residual=(
|
||||
len(predicted.sap_building_parts) - len(actual.sap_building_parts)
|
||||
),
|
||||
window_count_residual=(
|
||||
len(predicted.sap_windows) - len(actual.sap_windows)
|
||||
),
|
||||
total_window_area_residual=(
|
||||
_total_window_area(predicted) - _total_window_area(actual)
|
||||
),
|
||||
door_count_residual=predicted.door_count - actual.door_count,
|
||||
)
|
||||
68
domain/epc_prediction/prediction_target.py
Normal file
68
domain/epc_prediction/prediction_target.py
Normal file
|
|
@ -0,0 +1,68 @@
|
|||
"""Assemble an EPC-less Property's PredictionTarget, with the eligibility gate
|
||||
(ADR-0031 slice-5d).
|
||||
|
||||
A `PredictionTarget` needs the target's own known inputs: its postcode (to find
|
||||
the cohort), coordinates (to distance-weight it), and the Landlord-Override
|
||||
attributes that condition selection — `property_type` (the HARD cohort filter),
|
||||
plus optional `built_form` / `wall_construction`. `property_type` is required: a
|
||||
Property whose type is unknown is gated out (never sized from a mixed-type
|
||||
cohort), so the builder returns None and the caller skips prediction.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional, Union
|
||||
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
from domain.property.property import PropertyIdentity
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class PredictionTarget:
|
||||
"""The known inputs for the Property whose EPC we are predicting — the fields
|
||||
guaranteed at ingestion (plus any Landlord Overrides, added as they're used).
|
||||
`built_form` is often but not always known.
|
||||
"""
|
||||
|
||||
postcode: str
|
||||
property_type: str
|
||||
built_form: Optional[str] = None
|
||||
# A known Landlord Override (e.g. solid brick) conditions cohort selection —
|
||||
# matching comparables are emphasised while enough remain (ADR-0029).
|
||||
wall_construction: Optional[Union[int, str]] = None
|
||||
# The target Property's own coordinates (resolved from its UPRN), against
|
||||
# which neighbours are distance-weighted. None disables geo-weighting.
|
||||
coordinates: Optional[Coordinates] = None
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
class PredictionTargetAttributes:
|
||||
"""The target Property's own attributes resolved from Landlord Overrides,
|
||||
needed to find and condition its cohort. `property_type` is the code-space
|
||||
value the cohort EPCs carry (e.g. "2"); None means it could not be resolved,
|
||||
which gates the Property out of prediction."""
|
||||
|
||||
property_type: Optional[str]
|
||||
built_form: Optional[str] = None
|
||||
wall_construction: Optional[Union[int, str]] = None
|
||||
|
||||
|
||||
def build_prediction_target(
|
||||
identity: PropertyIdentity,
|
||||
coordinates: Optional[Coordinates],
|
||||
attributes: PredictionTargetAttributes,
|
||||
) -> Optional[PredictionTarget]:
|
||||
"""The PredictionTarget for an EPC-less Property, or None when ineligible —
|
||||
`property_type` is the hard cohort filter, so a Property whose type is unknown
|
||||
is gated out of prediction (ADR-0031) rather than sized from a mixed-type
|
||||
cohort."""
|
||||
if attributes.property_type is None:
|
||||
return None
|
||||
return PredictionTarget(
|
||||
postcode=identity.postcode,
|
||||
property_type=attributes.property_type,
|
||||
built_form=attributes.built_form,
|
||||
wall_construction=attributes.wall_construction,
|
||||
coordinates=coordinates,
|
||||
)
|
||||
160
domain/epc_prediction/validation.py
Normal file
160
domain/epc_prediction/validation.py
Normal file
|
|
@ -0,0 +1,160 @@
|
|||
"""Component Accuracy aggregation for EPC Prediction (ADR-0030).
|
||||
|
||||
The leave-one-out scorer, calculator-FREE on purpose: it holds out each SAP 10.2
|
||||
target, predicts it from its (all-vintage) ComparableProperty Properties, and aggregates
|
||||
the per-component classification hits + geometry residuals from
|
||||
`compare_prediction`. This is the *primary*, calculator-independent signal — the
|
||||
end-to-end SAP / carbon / PE check (which needs the calculator) is layered on top
|
||||
by the runner. The same function backs both the committed ratcheting gate and the
|
||||
offline national battle-test (one scorer, two harnesses).
|
||||
|
||||
Pure given the loaded cohorts: corpus IO (reading + mapping cert payloads) is the
|
||||
caller's job, so this is directly unit-testable.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from dataclasses import dataclass, field
|
||||
from datetime import date
|
||||
from typing import Iterable, Iterator, Optional, Sequence
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
||||
from domain.epc_prediction.comparable_properties import (
|
||||
ComparableProperty,
|
||||
select_comparables,
|
||||
)
|
||||
from domain.epc_prediction.epc_prediction import EpcPrediction
|
||||
from domain.epc_prediction.prediction_comparison import compare_prediction
|
||||
from domain.epc_prediction.prediction_target import PredictionTarget
|
||||
|
||||
# Only SAP 10.2 certs are valid held-out targets (ADR-0030) — the only vintage
|
||||
# with full-fidelity lodged components. The source cohort keeps all vintages.
|
||||
_SAP_10_2: float = 10.2
|
||||
|
||||
|
||||
def _empty_classification() -> dict[str, list[int]]:
|
||||
return {}
|
||||
|
||||
|
||||
def _empty_residuals() -> dict[str, list[float]]:
|
||||
return {}
|
||||
|
||||
|
||||
@dataclass
|
||||
class ComponentAccuracy:
|
||||
"""Aggregated leave-one-out Component Accuracy over a corpus.
|
||||
|
||||
`classification` maps a component name to [hits, applicable-total] (a
|
||||
not-applicable `None` hit is excluded from the total); `residuals` maps a
|
||||
numeric component to its signed (predicted − actual) values. `targets` counts
|
||||
the held-out SAP 10.2 properties scored.
|
||||
"""
|
||||
|
||||
classification: dict[str, list[int]] = field(
|
||||
default_factory=_empty_classification
|
||||
)
|
||||
residuals: dict[str, list[float]] = field(default_factory=_empty_residuals)
|
||||
targets: int = 0
|
||||
|
||||
def rate(self, component: str) -> Optional[float]:
|
||||
"""The classification hit-rate for a component, or None when nothing was
|
||||
applicable."""
|
||||
hits, total = self.classification.get(component, [0, 0])
|
||||
return hits / total if total else None
|
||||
|
||||
def mean_abs_residual(self, component: str) -> Optional[float]:
|
||||
"""Mean absolute residual for a numeric component, or None when empty."""
|
||||
values = self.residuals.get(component, [])
|
||||
return sum(abs(v) for v in values) / len(values) if values else None
|
||||
|
||||
|
||||
def _recency_key(comparable: ComparableProperty) -> tuple[date, str]:
|
||||
return (
|
||||
comparable.registration_date or date.min,
|
||||
comparable.certificate_number,
|
||||
)
|
||||
|
||||
|
||||
def _latest_per_address(cohort: Sequence[ComparableProperty]) -> list[ComparableProperty]:
|
||||
"""One held-out property per address — the latest cert, the best ground
|
||||
truth. Comparables with no address each stand alone."""
|
||||
latest: dict[str, ComparableProperty] = {}
|
||||
standalone: list[ComparableProperty] = []
|
||||
for c in cohort:
|
||||
if c.address is None:
|
||||
standalone.append(c)
|
||||
elif c.address not in latest or _recency_key(c) > _recency_key(
|
||||
latest[c.address]
|
||||
):
|
||||
latest[c.address] = c
|
||||
return list(latest.values()) + standalone
|
||||
|
||||
|
||||
def iter_predictions(
|
||||
cohorts: Iterable[Sequence[ComparableProperty]],
|
||||
*,
|
||||
target_sap_version: float = _SAP_10_2,
|
||||
) -> Iterator[tuple[EpcPropertyData, EpcPropertyData]]:
|
||||
"""Yield `(predicted, actual)` for every SAP-`target_sap_version` held-out
|
||||
target across the cohorts — the single leave-one-out orchestration the
|
||||
Component Accuracy scorer and the runner's calculator end-to-end both consume
|
||||
(ADR-0030: one scorer, two harnesses). A target is held out by whole address
|
||||
(so a re-lodgement can't leak) and predicted from its all-vintage cohort."""
|
||||
predictor = EpcPrediction()
|
||||
for cohort in cohorts:
|
||||
for held_out in _latest_per_address(cohort):
|
||||
if held_out.epc.sap_version != target_sap_version:
|
||||
continue
|
||||
others = [
|
||||
c
|
||||
for c in cohort
|
||||
if c.address is None or c.address != held_out.address
|
||||
]
|
||||
actual = held_out.epc
|
||||
target = PredictionTarget(
|
||||
postcode=actual.postcode,
|
||||
property_type=actual.property_type or "",
|
||||
built_form=actual.built_form,
|
||||
coordinates=held_out.coordinates,
|
||||
)
|
||||
comparables = select_comparables(target, others)
|
||||
if not comparables.members:
|
||||
continue
|
||||
yield predictor.predict(target, comparables), actual
|
||||
|
||||
|
||||
def evaluate_component_accuracy(
|
||||
cohorts: Iterable[Sequence[ComparableProperty]],
|
||||
*,
|
||||
target_sap_version: float = _SAP_10_2,
|
||||
) -> ComponentAccuracy:
|
||||
"""Score Component Accuracy by leave-one-out over each postcode cohort —
|
||||
aggregating the `compare_prediction` hits + residuals across every held-out
|
||||
SAP-`target_sap_version` target. Calculator-free (the primary signal)."""
|
||||
accuracy = ComponentAccuracy()
|
||||
for predicted, actual in iter_predictions(
|
||||
cohorts, target_sap_version=target_sap_version
|
||||
):
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
accuracy.targets += 1
|
||||
for name, hit in comparison.categorical_hits.items():
|
||||
counter = accuracy.classification.setdefault(name, [0, 0])
|
||||
if hit is not None:
|
||||
counter[1] += 1
|
||||
counter[0] += int(hit)
|
||||
accuracy.residuals.setdefault("floor_area", []).append(
|
||||
comparison.floor_area_residual
|
||||
)
|
||||
accuracy.residuals.setdefault("window_count", []).append(
|
||||
float(comparison.window_count_residual)
|
||||
)
|
||||
accuracy.residuals.setdefault("total_window_area", []).append(
|
||||
comparison.total_window_area_residual
|
||||
)
|
||||
accuracy.residuals.setdefault("building_parts", []).append(
|
||||
float(comparison.building_parts_residual)
|
||||
)
|
||||
accuracy.residuals.setdefault("door_count", []).append(
|
||||
float(comparison.door_count_residual)
|
||||
)
|
||||
return accuracy
|
||||
|
|
@ -7,7 +7,7 @@ from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
|||
from domain.geospatial.planning_restrictions import PlanningRestrictions
|
||||
from domain.property.site_notes import SiteNotes
|
||||
|
||||
SourcePath = Literal["site_notes", "epc_with_overlay"]
|
||||
SourcePath = Literal["site_notes", "epc_with_overlay", "predicted"]
|
||||
|
||||
|
||||
@dataclass(frozen=True)
|
||||
|
|
@ -38,6 +38,11 @@ class Property:
|
|||
identity: PropertyIdentity
|
||||
epc: Optional[EpcPropertyData] = None
|
||||
site_notes: Optional[SiteNotes] = None
|
||||
# A neighbour-synthesised EpcPropertyData (EPC Prediction gap-fill, ADR-0031),
|
||||
# held in its own slot so it coexists with any lodged `epc` (provenance is
|
||||
# structural). Used as the Effective EPC only as a last resort — when there is
|
||||
# neither a lodged EPC nor Site Notes; a real source always wins.
|
||||
predicted_epc: Optional[EpcPropertyData] = None
|
||||
# The current open-market value (a Property Valuation) — externally sourced
|
||||
# and mostly absent; feeds the Plan's Valuation Uplift £ forms (ADR-0018).
|
||||
current_market_value: Optional[float] = None
|
||||
|
|
@ -62,8 +67,11 @@ class Property:
|
|||
return "site_notes"
|
||||
if self.epc is not None:
|
||||
return "epc_with_overlay"
|
||||
if self.predicted_epc is not None:
|
||||
return "predicted"
|
||||
raise ValueError(
|
||||
"Property has neither Site Notes nor an EPC; no source path to model from"
|
||||
"Property has neither Site Notes, an EPC, nor a predicted EPC; "
|
||||
"no source path to model from"
|
||||
)
|
||||
|
||||
@property
|
||||
|
|
@ -71,10 +79,15 @@ class Property:
|
|||
"""The EpcPropertyData the modelling pipeline scores against.
|
||||
|
||||
Path 1: the Site Notes' surveyed data. Path 2: the public EPC (Landlord
|
||||
Overrides overlay is a later slice — returned as-is for now).
|
||||
Overrides overlay is a later slice — returned as-is for now). Path 3: a
|
||||
neighbour-synthesised EPC (EPC Prediction gap-fill, ADR-0031), used only
|
||||
when neither real source is present.
|
||||
"""
|
||||
if self.source_path == "site_notes":
|
||||
assert self.site_notes is not None
|
||||
return self.site_notes.to_epc_property_data()
|
||||
if self.source_path == "predicted":
|
||||
assert self.predicted_epc is not None
|
||||
return self.predicted_epc
|
||||
assert self.epc is not None
|
||||
return self.epc
|
||||
|
|
|
|||
129
harness/epc_prediction_corpus.py
Normal file
129
harness/epc_prediction_corpus.py
Normal file
|
|
@ -0,0 +1,129 @@
|
|||
"""Load a postcode-clustered EPC corpus into ComparableProperty cohorts (ADR-0030).
|
||||
|
||||
The IO half of the EPC Prediction validation: read each postcode's cached cert
|
||||
payloads, map them through `EpcPropertyDataMapper.from_api_response`, and build
|
||||
`ComparableProperty`s carrying the register metadata (address + registration date) the
|
||||
leave-one-out scorer needs to dedupe re-lodgements and hold out a whole address.
|
||||
A cert the mapper rejects (unsupported schema, malformed) is skipped, never fatal.
|
||||
|
||||
Shared by the committed-fixture gate, the local validation script, and the
|
||||
offline national battle-test — the corpus directory differs, the loading does
|
||||
not. Layout: `<dir>/<POSTCODE>/<cert>.json` + `<dir>/_index.json`.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import hashlib
|
||||
import json
|
||||
from datetime import date
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
|
||||
from datatypes.epc.domain.mapper import EpcPropertyDataMapper
|
||||
from domain.epc_prediction.comparable_properties import ComparableProperty
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
|
||||
# Identifying free-text fields blanked when freezing a payload into the committed
|
||||
# fixture (postcode is kept — it is coarse open data and the cohort key).
|
||||
_PII_BLANK_FIELDS = ("address_line_2", "address_line_3", "post_town")
|
||||
|
||||
|
||||
def load_corpus(corpus_dir: Path) -> list[list[ComparableProperty]]:
|
||||
"""Load every postcode cohort under `corpus_dir`. Returns one list of
|
||||
Comparables per postcode (the unit the leave-one-out scorer iterates)."""
|
||||
index_path = corpus_dir / "_index.json"
|
||||
if not index_path.exists():
|
||||
raise FileNotFoundError(
|
||||
f"no corpus index at {index_path} — run a corpus fetch first"
|
||||
)
|
||||
index: dict[str, list[str]] = json.loads(index_path.read_text())
|
||||
coordinates = load_coordinates(corpus_dir)
|
||||
return [
|
||||
_load_cohort(corpus_dir, postcode, certs, coordinates)
|
||||
for postcode, certs in index.items()
|
||||
]
|
||||
|
||||
|
||||
def _load_cohort(
|
||||
corpus_dir: Path,
|
||||
postcode: str,
|
||||
certs: list[str],
|
||||
coordinates: dict[int, Coordinates],
|
||||
) -> list[ComparableProperty]:
|
||||
cohort: list[ComparableProperty] = []
|
||||
for cert in certs:
|
||||
path = corpus_dir / postcode / f"{cert}.json"
|
||||
if not path.exists():
|
||||
continue
|
||||
raw: dict[str, Any] = json.loads(path.read_text())
|
||||
try:
|
||||
epc = EpcPropertyDataMapper.from_api_response(raw)
|
||||
except Exception: # noqa: BLE001 — a bad cert must not abort the sweep
|
||||
continue
|
||||
uprn = _uprn(raw)
|
||||
cohort.append(
|
||||
ComparableProperty(
|
||||
epc=epc,
|
||||
certificate_number=cert,
|
||||
address=_address(raw),
|
||||
registration_date=_registration_date(raw),
|
||||
coordinates=coordinates.get(uprn) if uprn is not None else None,
|
||||
)
|
||||
)
|
||||
return cohort
|
||||
|
||||
|
||||
def load_coordinates(corpus_dir: Path) -> dict[int, Coordinates]:
|
||||
"""The optional `_coordinates.json` sidecar (`{uprn: [lon, lat]}`), resolved
|
||||
from the OS Open-UPRN data by `fetch_corpus_coordinates.py`. Absent for a
|
||||
corpus without geo data — geo-weighting then simply stays off."""
|
||||
path = corpus_dir / "_coordinates.json"
|
||||
if not path.exists():
|
||||
return {}
|
||||
raw: dict[str, list[float]] = json.loads(path.read_text())
|
||||
return {
|
||||
int(uprn): Coordinates(longitude=lon_lat[0], latitude=lon_lat[1])
|
||||
for uprn, lon_lat in raw.items()
|
||||
}
|
||||
|
||||
|
||||
def _uprn(raw: dict[str, Any]) -> Optional[int]:
|
||||
value = raw.get("uprn")
|
||||
return int(value) if value is not None else None
|
||||
|
||||
|
||||
def stable_hash(prefix: str, value: str) -> str:
|
||||
"""A short, deterministic, one-way token for a free-text identifier. Stable
|
||||
across re-lodgements of the same address (normalised first), so dedup still
|
||||
collapses them — but the plaintext address never lands in the repo."""
|
||||
digest = hashlib.sha1(value.strip().upper().encode()).hexdigest()[:12]
|
||||
return f"{prefix}-{digest}"
|
||||
|
||||
|
||||
def anonymise_payload(raw: dict[str, Any]) -> dict[str, Any]:
|
||||
"""De-identify a cert payload for the committed fixture: hash the street
|
||||
address (`address_line_1`) and certificate number into stable tokens, blank
|
||||
the other free-text address lines, and keep everything else — postcode,
|
||||
registration date, SAP version, lodged figures, and all component fields —
|
||||
untouched (gov data is OGL; only the direct identifiers are removed)."""
|
||||
out = dict(raw)
|
||||
address = raw.get("address_line_1")
|
||||
if address:
|
||||
out["address_line_1"] = stable_hash("addr", str(address))
|
||||
cert = raw.get("certificate_number")
|
||||
if cert:
|
||||
out["certificate_number"] = stable_hash("cert", str(cert))
|
||||
for blank_field in _PII_BLANK_FIELDS:
|
||||
if blank_field in out:
|
||||
out[blank_field] = ""
|
||||
return out
|
||||
|
||||
|
||||
def _address(raw: dict[str, Any]) -> Optional[str]:
|
||||
value = raw.get("address_line_1")
|
||||
return str(value).strip().upper() if value else None
|
||||
|
||||
|
||||
def _registration_date(raw: dict[str, Any]) -> Optional[date]:
|
||||
value = raw.get("registration_date")
|
||||
return date.fromisoformat(str(value)) if value else None
|
||||
|
|
@ -25,6 +25,12 @@ class EpcPropertyModel(SQLModel, table=True):
|
|||
property_id: Optional[int] = Field(default=None)
|
||||
portfolio_id: Optional[int] = Field(default=None)
|
||||
uploaded_file_id: Optional[int] = Field(default=None)
|
||||
# Provenance of this EPC picture: "lodged" (a real public/landlord EPC) or
|
||||
# "predicted" (EPC Prediction gap-fill, ADR-0031). A property may hold one of
|
||||
# each, so reads filter on it. Defaults to "lodged" — every existing row is a
|
||||
# real EPC. (Requires a matching `source` column in the Drizzle schema — see
|
||||
# docs/handover; the SQLModel mirror is what the test DB builds from.)
|
||||
source: str = Field(default="lodged")
|
||||
|
||||
# Identity / admin
|
||||
uprn: Optional[int] = Field(default=None)
|
||||
|
|
@ -190,6 +196,7 @@ class EpcPropertyModel(SQLModel, table=True):
|
|||
data: EpcPropertyData,
|
||||
property_id: Optional[int] = None,
|
||||
portfolio_id: Optional[int] = None,
|
||||
source: str = "lodged",
|
||||
) -> EpcPropertyModel:
|
||||
es = data.sap_energy_source
|
||||
h = data.sap_heating
|
||||
|
|
@ -202,6 +209,7 @@ class EpcPropertyModel(SQLModel, table=True):
|
|||
return cls(
|
||||
property_id=property_id,
|
||||
portfolio_id=portfolio_id,
|
||||
source=source,
|
||||
uprn=data.uprn,
|
||||
uprn_source=data.uprn_source,
|
||||
report_reference=data.report_reference,
|
||||
|
|
|
|||
|
|
@ -5,7 +5,18 @@ from dataclasses import dataclass
|
|||
from typing import Any, Optional, Protocol
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
||||
from domain.epc_prediction.comparable_properties import (
|
||||
ComparableProperty,
|
||||
select_comparables,
|
||||
)
|
||||
from domain.epc_prediction.epc_prediction import EpcPrediction
|
||||
from domain.epc_prediction.prediction_target import (
|
||||
PredictionTargetAttributes,
|
||||
build_prediction_target,
|
||||
)
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
from domain.geospatial.spatial_reference import SpatialReference
|
||||
from domain.property.property import PropertyIdentity
|
||||
from repositories.geospatial.geospatial_repository import GeospatialRepository
|
||||
from repositories.unit_of_work import UnitOfWork
|
||||
|
||||
|
|
@ -16,6 +27,19 @@ class EpcFetcher(Protocol):
|
|||
def get_by_uprn(self, uprn: int) -> Optional[EpcPropertyData]: ...
|
||||
|
||||
|
||||
class ComparablesRepo(Protocol):
|
||||
"""The cohort source for EPC Prediction (e.g. EpcComparablePropertiesRepository)."""
|
||||
|
||||
def candidates_for(self, postcode: str) -> list[ComparableProperty]: ...
|
||||
|
||||
|
||||
class PredictionAttributesReader(Protocol):
|
||||
"""Resolves an EPC-less Property's prediction attributes from Landlord
|
||||
Overrides (e.g. the property_overrides read adapter)."""
|
||||
|
||||
def attributes_for(self, property_id: int) -> PredictionTargetAttributes: ...
|
||||
|
||||
|
||||
class SolarFetcher(Protocol):
|
||||
"""The slice of the Google Solar client Ingestion needs (e.g. GoogleSolarApiClient)."""
|
||||
|
||||
|
|
@ -24,6 +48,17 @@ class SolarFetcher(Protocol):
|
|||
) -> dict[str, Any]: ...
|
||||
|
||||
|
||||
@dataclass
|
||||
class _Prep:
|
||||
"""A property's transactional inputs read in the unit phase, before external
|
||||
IO: its identity (postcode + uprn) and, when the predictor is wired, its
|
||||
resolved prediction attributes (so the no-unit fetch phase can predict)."""
|
||||
|
||||
property_id: int
|
||||
identity: PropertyIdentity
|
||||
attributes: Optional[PredictionTargetAttributes]
|
||||
|
||||
|
||||
@dataclass
|
||||
class _Fetched:
|
||||
"""One property's externally-fetched source data, awaiting the write phase."""
|
||||
|
|
@ -31,6 +66,7 @@ class _Fetched:
|
|||
property_id: int
|
||||
uprn: int
|
||||
epc: Optional[EpcPropertyData]
|
||||
predicted_epc: Optional[EpcPropertyData]
|
||||
solar_insights: Optional[dict[str, Any]]
|
||||
spatial: Optional[SpatialReference]
|
||||
|
||||
|
|
@ -59,46 +95,104 @@ class IngestionOrchestrator:
|
|||
epc_fetcher: EpcFetcher,
|
||||
geospatial_repo: GeospatialRepository,
|
||||
solar_fetcher: SolarFetcher,
|
||||
comparables_repo: Optional[ComparablesRepo] = None,
|
||||
prediction_attributes_reader: Optional[PredictionAttributesReader] = None,
|
||||
epc_prediction: Optional[EpcPrediction] = None,
|
||||
) -> None:
|
||||
self._unit_of_work = unit_of_work
|
||||
self._epc_fetcher = epc_fetcher
|
||||
self._geospatial_repo = geospatial_repo
|
||||
self._solar_fetcher = solar_fetcher
|
||||
# EPC Prediction gap-fill (ADR-0031): when all three are wired, an EPC-less
|
||||
# Property is predicted from its postcode cohort and persisted to the
|
||||
# predicted slot. When any is absent, prediction is simply off and
|
||||
# ingestion behaves exactly as before.
|
||||
self._comparables_repo = comparables_repo
|
||||
self._prediction_attributes_reader = prediction_attributes_reader
|
||||
self._epc_prediction = epc_prediction
|
||||
|
||||
def run(self, property_ids: list[int]) -> None:
|
||||
uprns = self._uprns_for(property_ids)
|
||||
fetched = [self._fetch(property_id, uprn) for property_id, uprn in uprns]
|
||||
preps = self._prepare(property_ids)
|
||||
fetched = [self._fetch(prep) for prep in preps]
|
||||
self._persist(fetched)
|
||||
|
||||
def _uprns_for(self, property_ids: list[int]) -> list[tuple[int, int]]:
|
||||
def _prepare(self, property_ids: list[int]) -> list[_Prep]:
|
||||
# A short read unit; properties with no UPRN (e.g. landlord_property_id
|
||||
# only) are skipped — a later Site-Notes path covers them.
|
||||
# only) are skipped — a later Site-Notes path covers them. Prediction
|
||||
# attributes (Landlord Overrides) are resolved here, in-unit, so the
|
||||
# no-unit fetch phase holds everything it needs to predict.
|
||||
with self._unit_of_work() as uow:
|
||||
properties = uow.property.get_many(property_ids)
|
||||
return [
|
||||
(property_id, prop.identity.uprn)
|
||||
for property_id, prop in zip(property_ids, properties, strict=True)
|
||||
if prop.identity.uprn is not None
|
||||
]
|
||||
preps: list[_Prep] = []
|
||||
for property_id, prop in zip(property_ids, properties, strict=True):
|
||||
if prop.identity.uprn is None:
|
||||
continue
|
||||
attributes = (
|
||||
self._prediction_attributes_reader.attributes_for(property_id)
|
||||
if self._prediction_attributes_reader is not None
|
||||
else None
|
||||
)
|
||||
preps.append(_Prep(property_id, prop.identity, attributes))
|
||||
return preps
|
||||
|
||||
def _fetch(self, property_id: int, uprn: int) -> _Fetched:
|
||||
def _fetch(self, prep: _Prep) -> _Fetched:
|
||||
# No unit open here — this is the external-IO phase. One spatial
|
||||
# reference lookup yields the coordinates (which drive the Solar fetch)
|
||||
# and the planning protections (cached for Modelling, ADR-0020).
|
||||
uprn = prep.identity.uprn
|
||||
assert uprn is not None # _prepare drops UPRN-less properties
|
||||
epc = self._epc_fetcher.get_by_uprn(uprn)
|
||||
solar_insights: Optional[dict[str, Any]] = None
|
||||
spatial: Optional[SpatialReference] = self._geospatial_repo.spatial_for(uprn)
|
||||
if spatial is not None and spatial.coordinates is not None:
|
||||
coordinates = spatial.coordinates if spatial is not None else None
|
||||
if coordinates is not None:
|
||||
solar_insights = self._solar_fetcher.get_building_insights(
|
||||
spatial.coordinates.longitude, spatial.coordinates.latitude
|
||||
coordinates.longitude, coordinates.latitude
|
||||
)
|
||||
return _Fetched(property_id, uprn, epc, solar_insights, spatial)
|
||||
predicted_epc = (
|
||||
self._predict(prep.identity, coordinates, prep.attributes)
|
||||
if epc is None
|
||||
else None
|
||||
)
|
||||
return _Fetched(
|
||||
prep.property_id, uprn, epc, predicted_epc, solar_insights, spatial
|
||||
)
|
||||
|
||||
def _predict(
|
||||
self,
|
||||
identity: PropertyIdentity,
|
||||
coordinates: Optional[Coordinates],
|
||||
attributes: Optional[PredictionTargetAttributes],
|
||||
) -> Optional[EpcPropertyData]:
|
||||
"""Synthesise the EPC-less Property's picture from its postcode cohort, or
|
||||
None when the predictor is unwired, the Property is gated out (unknown
|
||||
property type), or no comparables survive selection (ADR-0031)."""
|
||||
if (
|
||||
self._comparables_repo is None
|
||||
or self._epc_prediction is None
|
||||
or attributes is None
|
||||
):
|
||||
return None
|
||||
target = build_prediction_target(identity, coordinates, attributes)
|
||||
if target is None:
|
||||
return None
|
||||
candidates = self._comparables_repo.candidates_for(identity.postcode)
|
||||
comparables = select_comparables(target, candidates)
|
||||
if not comparables.members:
|
||||
return None
|
||||
return self._epc_prediction.predict(target, comparables)
|
||||
|
||||
def _persist(self, fetched: list[_Fetched]) -> None:
|
||||
with self._unit_of_work() as uow:
|
||||
for item in fetched:
|
||||
if item.epc is not None:
|
||||
uow.epc.save(item.epc, property_id=item.property_id)
|
||||
elif item.predicted_epc is not None:
|
||||
uow.epc.save(
|
||||
item.predicted_epc,
|
||||
property_id=item.property_id,
|
||||
source="predicted",
|
||||
)
|
||||
# The live `solar` table is keyed by UPRN and needs the fetch's
|
||||
# coordinates; insights are only set when those coordinates were
|
||||
# resolved, so spatial.coordinates is non-None alongside them.
|
||||
|
|
|
|||
0
repositories/comparable_properties/__init__.py
Normal file
0
repositories/comparable_properties/__init__.py
Normal file
|
|
@ -0,0 +1,24 @@
|
|||
"""The ComparableProperties repository port (ADR-0029 decision 3; ADR-0031).
|
||||
|
||||
Owns the cohort IO for EPC Prediction — given a target's postcode, return the
|
||||
candidate `ComparableProperty`s (the postcode's other lodged certs, mapped to
|
||||
`EpcPropertyData` with their register metadata + resolved coordinates). The pure
|
||||
domain `select_comparables` then filters these into the reference cohort, and
|
||||
`EpcPrediction.predict` synthesises the picture. Kept a port so the orchestrator
|
||||
depends on the cohort source abstractly and tests substitute a fake.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from domain.epc_prediction.comparable_properties import ComparableProperty
|
||||
|
||||
|
||||
class ComparablePropertiesRepository(ABC):
|
||||
@abstractmethod
|
||||
def candidates_for(self, postcode: str) -> list[ComparableProperty]:
|
||||
"""Every candidate neighbour in `postcode` — one `ComparableProperty` per lodged
|
||||
cert, carrying its `EpcPropertyData`, certificate number, address,
|
||||
registration date, and resolved coordinates (None when unresolvable)."""
|
||||
...
|
||||
|
|
@ -0,0 +1,82 @@
|
|||
"""EPC-API + geospatial adapter for the ComparableProperties port (ADR-0031).
|
||||
|
||||
Assembles a postcode's candidate cohort: the EPC search lists the postcode's
|
||||
lodged certs, each is fetched + mapped to `EpcPropertyData`, and the certs' UPRNs
|
||||
are resolved to coordinates in one batched geospatial read (closely-numbered
|
||||
UPRNs share a partition). Register metadata the cert itself doesn't carry
|
||||
(address, registration date) is threaded off the search row.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from datetime import date
|
||||
from typing import Optional, Protocol
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
||||
from datatypes.epc.search.epc_search_result import EpcSearchResult
|
||||
from domain.epc_prediction.comparable_properties import ComparableProperty
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
from repositories.comparable_properties.comparable_properties_repository import (
|
||||
ComparablePropertiesRepository,
|
||||
)
|
||||
|
||||
|
||||
class CohortEpcClient(Protocol):
|
||||
"""The slice of the EPC-API client the cohort fetch needs (e.g.
|
||||
`EpcClientService`)."""
|
||||
|
||||
def search_by_postcode(self, postcode: str) -> list[EpcSearchResult]: ...
|
||||
|
||||
def get_by_certificate_number(self, cert_num: str) -> EpcPropertyData: ...
|
||||
|
||||
|
||||
class CohortGeospatial(Protocol):
|
||||
"""The geospatial slice the cohort fetch needs — batch UPRN→coordinate."""
|
||||
|
||||
def coordinates_for_uprns(
|
||||
self, uprns: list[int]
|
||||
) -> dict[int, Coordinates]: ...
|
||||
|
||||
|
||||
class EpcComparablePropertiesRepository(ComparablePropertiesRepository):
|
||||
def __init__(
|
||||
self, epc_client: CohortEpcClient, geospatial: CohortGeospatial
|
||||
) -> None:
|
||||
self._epc_client = epc_client
|
||||
self._geospatial = geospatial
|
||||
|
||||
def candidates_for(self, postcode: str) -> list[ComparableProperty]:
|
||||
results: list[EpcSearchResult] = self._epc_client.search_by_postcode(
|
||||
postcode
|
||||
)
|
||||
uprns: list[int] = [r.uprn for r in results if r.uprn is not None]
|
||||
coordinates: dict[int, Coordinates] = self._geospatial.coordinates_for_uprns(
|
||||
uprns
|
||||
)
|
||||
return [self._comparable(result, coordinates) for result in results]
|
||||
|
||||
def _comparable(
|
||||
self, result: EpcSearchResult, coordinates: dict[int, Coordinates]
|
||||
) -> ComparableProperty:
|
||||
epc: EpcPropertyData = self._epc_client.get_by_certificate_number(
|
||||
result.certificate_number
|
||||
)
|
||||
resolved: Optional[Coordinates] = (
|
||||
coordinates.get(result.uprn) if result.uprn is not None else None
|
||||
)
|
||||
return ComparableProperty(
|
||||
epc=epc,
|
||||
certificate_number=result.certificate_number,
|
||||
address=result.address_line_1,
|
||||
registration_date=_parse_date(result.registration_date),
|
||||
coordinates=resolved,
|
||||
)
|
||||
|
||||
|
||||
def _parse_date(value: str) -> Optional[date]:
|
||||
"""The register's ISO registration date, or None when unparseable (the
|
||||
predictor falls back to an unweighted recency)."""
|
||||
try:
|
||||
return date.fromisoformat(value[:10])
|
||||
except ValueError:
|
||||
return None
|
||||
|
|
@ -45,7 +45,7 @@ from infrastructure.postgres.epc_property_table import (
|
|||
EpcRenewableHeatIncentiveModel,
|
||||
EpcWindowModel,
|
||||
)
|
||||
from repositories.epc.epc_repository import EpcRepository
|
||||
from repositories.epc.epc_repository import EpcRepository, EpcSource
|
||||
from utilities.private import private
|
||||
|
||||
_T = TypeVar("_T")
|
||||
|
|
@ -88,14 +88,16 @@ class EpcPostgresRepository(EpcRepository):
|
|||
data: EpcPropertyData,
|
||||
property_id: Optional[int] = None,
|
||||
portfolio_id: Optional[int] = None,
|
||||
source: EpcSource = "lodged",
|
||||
) -> int:
|
||||
# Idempotent on property_id: a re-run replaces the property's EPC graph
|
||||
# rather than duplicating it (ADR-0012). Anonymous saves (no property_id)
|
||||
# always insert.
|
||||
# Idempotent on (property_id, source): a re-run replaces the property's
|
||||
# EPC graph for THAT source rather than duplicating it (ADR-0012), and a
|
||||
# predicted save leaves the lodged one intact, and vice versa (ADR-0031).
|
||||
# Anonymous saves (no property_id) always insert.
|
||||
if property_id is not None:
|
||||
self._delete_for_property(property_id)
|
||||
self._delete_for_property(property_id, source)
|
||||
parent = EpcPropertyModel.from_epc_property_data(
|
||||
data, property_id=property_id, portfolio_id=portfolio_id
|
||||
data, property_id=property_id, portfolio_id=portfolio_id, source=source
|
||||
)
|
||||
self._session.add(parent)
|
||||
self._session.flush()
|
||||
|
|
@ -154,15 +156,16 @@ class EpcPostgresRepository(EpcRepository):
|
|||
)
|
||||
return epc_property_id
|
||||
|
||||
def _delete_for_property(self, property_id: int) -> None:
|
||||
"""Remove the property's existing EPC graph (parent + child tables) so a
|
||||
re-save replaces rather than duplicates (ADR-0012)."""
|
||||
def _delete_for_property(self, property_id: int, source: EpcSource) -> None:
|
||||
"""Remove the property's existing EPC graph for `source` (parent + child
|
||||
tables) so a re-save replaces rather than duplicates (ADR-0012), without
|
||||
disturbing the other source's slot (ADR-0031)."""
|
||||
epc_ids = [
|
||||
i
|
||||
for i in self._session.exec(
|
||||
select(EpcPropertyModel.id).where(
|
||||
EpcPropertyModel.property_id == property_id
|
||||
)
|
||||
select(EpcPropertyModel.id)
|
||||
.where(EpcPropertyModel.property_id == property_id)
|
||||
.where(EpcPropertyModel.source == source)
|
||||
).all()
|
||||
if i is not None
|
||||
]
|
||||
|
|
@ -200,9 +203,20 @@ class EpcPostgresRepository(EpcRepository):
|
|||
)
|
||||
|
||||
def get_for_property(self, property_id: int) -> Optional[EpcPropertyData]:
|
||||
return self._get_for_property(property_id, source="lodged")
|
||||
|
||||
def get_predicted_for_property(
|
||||
self, property_id: int
|
||||
) -> Optional[EpcPropertyData]:
|
||||
return self._get_for_property(property_id, source="predicted")
|
||||
|
||||
def _get_for_property(
|
||||
self, property_id: int, source: EpcSource
|
||||
) -> Optional[EpcPropertyData]:
|
||||
row = self._session.exec(
|
||||
select(EpcPropertyModel)
|
||||
.where(EpcPropertyModel.property_id == property_id)
|
||||
.where(EpcPropertyModel.source == source)
|
||||
.order_by(EpcPropertyModel.id) # type: ignore[arg-type]
|
||||
).first()
|
||||
if row is None or row.id is None:
|
||||
|
|
@ -212,13 +226,26 @@ class EpcPostgresRepository(EpcRepository):
|
|||
def get_for_properties(
|
||||
self, property_ids: list[int]
|
||||
) -> dict[int, EpcPropertyData]:
|
||||
"""Bulk-hydrate a batch's EPCs in a handful of per-table IN queries
|
||||
(ADR-0012), not N x per-property. Load-whole per ADR-0002."""
|
||||
"""Bulk-hydrate a batch's LODGED EPCs, keyed by property_id."""
|
||||
return self._for_properties(property_ids, source="lodged")
|
||||
|
||||
def get_predicted_for_properties(
|
||||
self, property_ids: list[int]
|
||||
) -> dict[int, EpcPropertyData]:
|
||||
"""Bulk-hydrate a batch's PREDICTED EPCs (ADR-0031), keyed by property_id."""
|
||||
return self._for_properties(property_ids, source="predicted")
|
||||
|
||||
def _for_properties(
|
||||
self, property_ids: list[int], source: EpcSource
|
||||
) -> dict[int, EpcPropertyData]:
|
||||
"""Bulk-hydrate a batch's EPCs of one `source` in a handful of per-table IN
|
||||
queries (ADR-0012), not N x per-property. Load-whole per ADR-0002."""
|
||||
if not property_ids:
|
||||
return {}
|
||||
parents = self._session.exec(
|
||||
select(EpcPropertyModel)
|
||||
.where(col(EpcPropertyModel.property_id).in_(property_ids))
|
||||
.where(EpcPropertyModel.source == source)
|
||||
.order_by(EpcPropertyModel.id) # type: ignore[arg-type]
|
||||
).all()
|
||||
parent_by_property: dict[int, EpcPropertyModel] = {}
|
||||
|
|
|
|||
|
|
@ -1,10 +1,14 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
from typing import Optional
|
||||
from typing import Literal, Optional
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
||||
|
||||
# Provenance of a persisted EPC picture (ADR-0031): a real "lodged" EPC, or a
|
||||
# "predicted" one synthesised by EPC Prediction. A property can hold one of each.
|
||||
EpcSource = Literal["lodged", "predicted"]
|
||||
|
||||
|
||||
class EpcRepository(ABC):
|
||||
"""Persists and loads the structured EPC Property Data slice.
|
||||
|
|
@ -12,7 +16,8 @@ class EpcRepository(ABC):
|
|||
`save` writes the `EpcPropertyData` to the `epc_property` parent row and its
|
||||
child tables; `get` reconstructs the persisted projection back into an
|
||||
`EpcPropertyData`. Round-trip fidelity over that projection is pinned by the
|
||||
Slice-1 round-trip test (Hestia-Homes/Model#1129).
|
||||
Slice-1 round-trip test (Hestia-Homes/Model#1129). Each EPC carries a
|
||||
`source` so a lodged and a predicted picture coexist per property (ADR-0031).
|
||||
"""
|
||||
|
||||
@abstractmethod
|
||||
|
|
@ -21,18 +26,36 @@ class EpcRepository(ABC):
|
|||
data: EpcPropertyData,
|
||||
property_id: int | None = None,
|
||||
portfolio_id: int | None = None,
|
||||
source: EpcSource = "lodged",
|
||||
) -> int: ...
|
||||
|
||||
@abstractmethod
|
||||
def get(self, epc_property_id: int) -> EpcPropertyData: ...
|
||||
|
||||
@abstractmethod
|
||||
def get_for_property(self, property_id: int) -> Optional[EpcPropertyData]: ...
|
||||
def get_for_property(self, property_id: int) -> Optional[EpcPropertyData]:
|
||||
"""The property's LODGED EPC (the predicted slot is read separately)."""
|
||||
...
|
||||
|
||||
@abstractmethod
|
||||
def get_predicted_for_property(
|
||||
self, property_id: int
|
||||
) -> Optional[EpcPropertyData]:
|
||||
"""The property's PREDICTED EPC (EPC Prediction gap-fill), or None."""
|
||||
...
|
||||
|
||||
@abstractmethod
|
||||
def get_for_properties(
|
||||
self, property_ids: list[int]
|
||||
) -> dict[int, EpcPropertyData]:
|
||||
"""Bulk-hydrate a batch's EPCs, keyed by property_id (only those with an
|
||||
EPC are present). A handful of per-table queries, not N per property."""
|
||||
"""Bulk-hydrate a batch's LODGED EPCs, keyed by property_id (only those
|
||||
with one are present). A handful of per-table queries, not N per property."""
|
||||
...
|
||||
|
||||
@abstractmethod
|
||||
def get_predicted_for_properties(
|
||||
self, property_ids: list[int]
|
||||
) -> dict[int, EpcPropertyData]:
|
||||
"""Bulk-hydrate a batch's PREDICTED EPCs (ADR-0031), keyed by property_id
|
||||
(only those with one are present)."""
|
||||
...
|
||||
|
|
|
|||
|
|
@ -18,6 +18,21 @@ class GeospatialRepository(ABC):
|
|||
@abstractmethod
|
||||
def coordinates_for(self, uprn: int) -> Optional[Coordinates]: ...
|
||||
|
||||
def coordinates_for_uprns(
|
||||
self, uprns: list[int]
|
||||
) -> dict[int, Coordinates]:
|
||||
"""Resolve many UPRNs at once, returning only those covered. The default
|
||||
is a per-UPRN loop; adapters whose storage is partitioned (e.g. the S3
|
||||
Open-UPRN parquet) override this to read each partition once for all the
|
||||
UPRNs it covers — far fewer reads when the UPRNs are co-located, as
|
||||
closely-numbered UPRNs share a partition."""
|
||||
resolved: dict[int, Coordinates] = {}
|
||||
for uprn in uprns:
|
||||
coordinates = self.coordinates_for(uprn)
|
||||
if coordinates is not None:
|
||||
resolved[uprn] = coordinates
|
||||
return resolved
|
||||
|
||||
def spatial_for(self, uprn: int) -> Optional[SpatialReference]:
|
||||
"""The Property's coordinates and planning protections together, in one
|
||||
reference lookup (ADR-0020) — Ingestion uses the coordinates to drive
|
||||
|
|
|
|||
|
|
@ -1,5 +1,6 @@
|
|||
from __future__ import annotations
|
||||
|
||||
from collections import defaultdict
|
||||
from collections.abc import Callable
|
||||
from typing import Any, Optional
|
||||
|
||||
|
|
@ -62,6 +63,30 @@ class GeospatialS3Repository(GeospatialRepository):
|
|||
reference: Optional[SpatialReference] = self.spatial_for(uprn)
|
||||
return reference.coordinates if reference is not None else None
|
||||
|
||||
def coordinates_for_uprns(
|
||||
self, uprns: list[int]
|
||||
) -> dict[int, Coordinates]:
|
||||
"""Batch lookup that reads the meta once, groups the UPRNs by their
|
||||
covering partition, and reads each partition once for all the UPRNs it
|
||||
covers (co-located UPRNs share a partition, so a cohort is typically one
|
||||
or two reads). Uncovered / absent UPRNs are omitted from the result."""
|
||||
meta = self._read_parquet(_META_KEY)
|
||||
by_partition: dict[str, list[int]] = defaultdict(list)
|
||||
for uprn in uprns:
|
||||
covering = meta[(meta["lower"] <= uprn) & (meta["upper"] >= uprn)]
|
||||
if not covering.empty:
|
||||
by_partition[str(covering["filenames"].iloc[0])].append(uprn)
|
||||
resolved: dict[int, Coordinates] = {}
|
||||
for filename, partition_uprns in by_partition.items():
|
||||
partition = self._read_parquet(f"spatial/{filename}")
|
||||
rows = partition[partition["UPRN"].isin(partition_uprns)]
|
||||
for _, row in rows.iterrows():
|
||||
resolved[int(row["UPRN"])] = Coordinates(
|
||||
longitude=float(row["LONGITUDE"]),
|
||||
latitude=float(row["LATITUDE"]),
|
||||
)
|
||||
return resolved
|
||||
|
||||
def planning_restrictions_for(self, uprn: int) -> Optional[PlanningRestrictions]:
|
||||
reference: Optional[SpatialReference] = self.spatial_for(uprn)
|
||||
return reference.restrictions if reference is not None else None
|
||||
|
|
|
|||
23
repositories/property/prediction_target_attributes_reader.py
Normal file
23
repositories/property/prediction_target_attributes_reader.py
Normal file
|
|
@ -0,0 +1,23 @@
|
|||
"""Read port for an EPC-less Property's prediction attributes (ADR-0031 slice-5d).
|
||||
|
||||
Returns the `property_type` / `built_form` / `wall_construction` resolved from
|
||||
Landlord Overrides that `build_prediction_target` needs. Kept a port because the
|
||||
real adapter — a read over the `property_overrides` fact layer — is being built
|
||||
separately (see docs/HANDOVER_EPC_PREDICTION_WIRING.md); the ingestion wiring
|
||||
depends on this abstraction and tests substitute a fake.
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from abc import ABC, abstractmethod
|
||||
|
||||
from domain.epc_prediction.prediction_target import PredictionTargetAttributes
|
||||
|
||||
|
||||
class PredictionTargetAttributesReader(ABC):
|
||||
@abstractmethod
|
||||
def attributes_for(self, property_id: int) -> PredictionTargetAttributes:
|
||||
"""The Property's resolved prediction attributes. `property_type` is None
|
||||
when it could not be resolved — which gates the Property out of
|
||||
prediction (`build_prediction_target` returns None)."""
|
||||
...
|
||||
|
|
@ -72,6 +72,7 @@ class PropertyPostgresRepository(PropertyRepository):
|
|||
return Property(
|
||||
identity=identity,
|
||||
epc=self._epc().get_for_property(property_id),
|
||||
predicted_epc=self._epc().get_predicted_for_property(property_id),
|
||||
planning_restrictions=_restrictions_of(row.uprn, restrictions),
|
||||
)
|
||||
|
||||
|
|
@ -83,6 +84,7 @@ class PropertyPostgresRepository(PropertyRepository):
|
|||
).all()
|
||||
row_by_id = {row.id: row for row in rows}
|
||||
epcs = self._epc().get_for_properties(property_ids)
|
||||
predicted_epcs = self._epc().get_predicted_for_properties(property_ids)
|
||||
restrictions: dict[int, PlanningRestrictions] = self._restrictions_for(
|
||||
[row.uprn for row in rows if row.uprn is not None]
|
||||
)
|
||||
|
|
@ -101,6 +103,7 @@ class PropertyPostgresRepository(PropertyRepository):
|
|||
landlord_property_id=row.landlord_property_id,
|
||||
),
|
||||
epc=epcs.get(property_id),
|
||||
predicted_epc=predicted_epcs.get(property_id),
|
||||
planning_restrictions=_restrictions_of(row.uprn, restrictions),
|
||||
)
|
||||
)
|
||||
|
|
|
|||
117
scripts/build_epc_prediction_fixture.py
Normal file
117
scripts/build_epc_prediction_fixture.py
Normal file
|
|
@ -0,0 +1,117 @@
|
|||
"""Freeze a small, anonymised EPC Prediction fixture for the Tier-1 gate (ADR-0030).
|
||||
|
||||
Curates a deterministic subset of the local scratch corpus
|
||||
(`/tmp/epc_prediction_corpus`, gitignored) into a committed fixture under
|
||||
`tests/fixtures/epc_prediction/`. Selection keeps postcodes that can actually be
|
||||
scored — at least one SAP 10.2 target plus a second distinct address to predict
|
||||
it from. Every payload is run through `anonymise_payload` first, so the street
|
||||
address + certificate number become opaque tokens and no plaintext address lands
|
||||
in the repo (postcode + component data are open gov data and kept).
|
||||
|
||||
The committed fixture is the deterministic basis for the ratcheting gate; the
|
||||
large scratch corpus stays local for iteration + the offline battle-test.
|
||||
|
||||
USAGE
|
||||
-----
|
||||
PYTHONPATH=. python scripts/build_epc_prediction_fixture.py
|
||||
|
||||
Source: $EPC_PREDICTION_CORPUS (default /tmp/epc_prediction_corpus).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
import os
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
from harness.epc_prediction_corpus import anonymise_payload, stable_hash
|
||||
|
||||
SOURCE = Path(os.environ.get("EPC_PREDICTION_CORPUS", "/tmp/epc_prediction_corpus"))
|
||||
FIXTURE = Path("tests/fixtures/epc_prediction")
|
||||
|
||||
_SAP_10_2 = "10.2"
|
||||
_MAX_POSTCODES = 15 # keep the committed fixture small
|
||||
_MAX_COHORT = 25 # cap certs per postcode to bound repo size
|
||||
|
||||
|
||||
def _load_payloads(
|
||||
postcode: str, certs: list[str]
|
||||
) -> list[tuple[str, dict[str, Any]]]:
|
||||
"""The `(source cert number, payload)` pairs for a postcode — the cert
|
||||
number lives in the index/filename, not the cached payload."""
|
||||
payloads: list[tuple[str, dict[str, Any]]] = []
|
||||
for cert in certs:
|
||||
path = SOURCE / postcode / f"{cert}.json"
|
||||
if path.exists():
|
||||
payloads.append((cert, json.loads(path.read_text())))
|
||||
return payloads
|
||||
|
||||
|
||||
def _qualifies(payloads: list[tuple[str, dict[str, Any]]]) -> bool:
|
||||
"""A postcode is usable iff it has ≥1 SAP 10.2 cert (a valid target) and ≥2
|
||||
distinct addresses (so the target has at least one neighbour to predict it)."""
|
||||
has_target = any(
|
||||
str(p.get("sap_version")) == _SAP_10_2 for _, p in payloads
|
||||
)
|
||||
addresses = {
|
||||
str(p.get("address_line_1", "")).strip().upper() for _, p in payloads
|
||||
}
|
||||
return has_target and len(addresses) >= 2
|
||||
|
||||
|
||||
def main() -> None:
|
||||
index: dict[str, list[str]] = json.loads(
|
||||
(SOURCE / "_index.json").read_text()
|
||||
)
|
||||
fixture_index: dict[str, list[str]] = {}
|
||||
kept_uprns: set[str] = set()
|
||||
total_certs = 0
|
||||
for postcode, certs in index.items():
|
||||
if len(fixture_index) >= _MAX_POSTCODES:
|
||||
break
|
||||
payloads = _load_payloads(postcode, certs)
|
||||
if not _qualifies(payloads):
|
||||
continue
|
||||
kept: list[str] = []
|
||||
for cert, raw in payloads[:_MAX_COHORT]:
|
||||
cert_token = stable_hash("cert", cert)
|
||||
anon = anonymise_payload(raw)
|
||||
out = FIXTURE / postcode / f"{cert_token}.json"
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(anon))
|
||||
kept.append(cert_token)
|
||||
uprn = raw.get("uprn")
|
||||
if uprn is not None:
|
||||
kept_uprns.add(str(int(uprn)))
|
||||
fixture_index[postcode] = kept
|
||||
total_certs += len(kept)
|
||||
(FIXTURE / "_index.json").parent.mkdir(parents=True, exist_ok=True)
|
||||
(FIXTURE / "_index.json").write_text(json.dumps(fixture_index, indent=2))
|
||||
_write_coordinates(kept_uprns)
|
||||
print(
|
||||
f"wrote {len(fixture_index)} postcodes / {total_certs} anonymised certs "
|
||||
f"to {FIXTURE}"
|
||||
)
|
||||
|
||||
|
||||
def _write_coordinates(kept_uprns: set[str]) -> None:
|
||||
"""Carry the geo-proximity coordinates for the kept UPRNs into the committed
|
||||
fixture (subset of the corpus `_coordinates.json`), so the gate exercises
|
||||
geo-weighting without S3. Skipped when the corpus has no coordinates sidecar.
|
||||
Coordinates are OS OpenData (OGL) and add no identifiability beyond the UPRN
|
||||
already kept in the fixture."""
|
||||
source = SOURCE / "_coordinates.json"
|
||||
if not source.exists():
|
||||
return
|
||||
corpus_coords: dict[str, list[float]] = json.loads(source.read_text())
|
||||
fixture_coords = {
|
||||
uprn: corpus_coords[uprn]
|
||||
for uprn in kept_uprns
|
||||
if uprn in corpus_coords
|
||||
}
|
||||
(FIXTURE / "_coordinates.json").write_text(json.dumps(fixture_coords))
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
101
scripts/fetch_corpus_coordinates.py
Normal file
101
scripts/fetch_corpus_coordinates.py
Normal file
|
|
@ -0,0 +1,101 @@
|
|||
"""One-time: resolve coordinates for every EPC Prediction corpus UPRN (#1227).
|
||||
|
||||
Reads the OS Open-UPRN parquet from S3 (DATA_BUCKET / spatial/) via boto3 and
|
||||
resolves each corpus cert's `uprn` to WGS84 lon/lat. UPRNs are grouped by their
|
||||
covering partition (the same UPRN-range bucketing `GeospatialS3Repository` uses),
|
||||
so each ~1.7 MB partition is read at most once — the efficient batch lookup we
|
||||
intend to add to the Geospatial Repo. Caches `{uprn: [lon, lat]}` locally
|
||||
(gitignored) so the validation harness can score intra-postcode distances
|
||||
without S3.
|
||||
|
||||
USAGE
|
||||
-----
|
||||
set -a; . backend/.env; set +a
|
||||
PYTHONPATH=. python scripts/fetch_corpus_coordinates.py
|
||||
|
||||
Source corpus: $EPC_PREDICTION_CORPUS (default /tmp/epc_prediction_corpus).
|
||||
Output: <corpus>/../epc_prediction_corpus_coords.json
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import io
|
||||
import json
|
||||
import os
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
|
||||
import boto3
|
||||
import pandas as pd
|
||||
|
||||
CORPUS = Path(os.environ.get("EPC_PREDICTION_CORPUS", "/tmp/epc_prediction_corpus"))
|
||||
# Sidecar inside the corpus dir, so `load_corpus` picks it up automatically.
|
||||
OUT = CORPUS / "_coordinates.json"
|
||||
_BUCKET = os.environ["DATA_BUCKET"]
|
||||
_META_KEY = "spatial/filename_meta.parquet"
|
||||
|
||||
|
||||
def _reader() -> Any:
|
||||
# boto3.client is overloaded per-service in the installed stubs; bind to Any.
|
||||
boto3_client: Any = boto3.client # pyright: ignore[reportUnknownMemberType, reportUnknownVariableType]
|
||||
s3: Any = boto3_client("s3")
|
||||
|
||||
def read_parquet(key: str) -> pd.DataFrame:
|
||||
response: dict[str, Any] = s3.get_object(Bucket=_BUCKET, Key=key)
|
||||
body: bytes = response["Body"].read()
|
||||
return pd.read_parquet(io.BytesIO(body))
|
||||
|
||||
return read_parquet
|
||||
|
||||
|
||||
def _corpus_uprns() -> set[int]:
|
||||
index: dict[str, list[str]] = json.loads((CORPUS / "_index.json").read_text())
|
||||
uprns: set[int] = set()
|
||||
for postcode, certs in index.items():
|
||||
for cert in certs:
|
||||
path = CORPUS / postcode / f"{cert}.json"
|
||||
if not path.exists():
|
||||
continue
|
||||
raw: dict[str, Any] = json.loads(path.read_text())
|
||||
uprn = raw.get("uprn")
|
||||
if uprn is not None:
|
||||
uprns.add(int(uprn))
|
||||
return uprns
|
||||
|
||||
|
||||
def main() -> None:
|
||||
read_parquet = _reader()
|
||||
uprns = _corpus_uprns()
|
||||
print(f"corpus UPRNs: {len(uprns)}")
|
||||
|
||||
meta = read_parquet(_META_KEY)
|
||||
# Group each UPRN by its covering partition (lower <= uprn <= upper), so each
|
||||
# partition file is read once for all the UPRNs it covers.
|
||||
by_partition: dict[str, list[int]] = defaultdict(list)
|
||||
uncovered = 0
|
||||
for uprn in uprns:
|
||||
covering = meta[(meta["lower"] <= uprn) & (meta["upper"] >= uprn)]
|
||||
if covering.empty:
|
||||
uncovered += 1
|
||||
continue
|
||||
by_partition[str(covering["filenames"].iloc[0])].append(uprn)
|
||||
print(f"distinct partitions to read: {len(by_partition)}; uncovered: {uncovered}")
|
||||
|
||||
coords: dict[str, list[float]] = {}
|
||||
for i, (filename, part_uprns) in enumerate(sorted(by_partition.items()), 1):
|
||||
partition = read_parquet(f"spatial/{filename}")
|
||||
rows = partition[partition["UPRN"].isin(part_uprns)]
|
||||
for _, row in rows.iterrows():
|
||||
coords[str(int(row["UPRN"]))] = [
|
||||
float(row["LONGITUDE"]),
|
||||
float(row["LATITUDE"]),
|
||||
]
|
||||
print(f" [{i}/{len(by_partition)}] {filename}: +{len(rows)}")
|
||||
|
||||
OUT.write_text(json.dumps(coords))
|
||||
print(f"resolved {len(coords)}/{len(uprns)} UPRNs -> {OUT}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
162
scripts/fetch_epc_prediction_corpus.py
Normal file
162
scripts/fetch_epc_prediction_corpus.py
Normal file
|
|
@ -0,0 +1,162 @@
|
|||
"""Build the frozen postcode-clustered corpus for EPC Prediction validation
|
||||
(ADR-0029).
|
||||
|
||||
WHAT THIS IS FOR
|
||||
----------------
|
||||
EPC Prediction estimates an EPC-less Property's `EpcPropertyData` from its
|
||||
**ComparableProperty Properties** — the other certs in its postcode. Validating that
|
||||
needs *geographic clusters* (many certs per postcode), not random certs, so the
|
||||
leave-one-out harness can drop one cert and predict it from its neighbours.
|
||||
|
||||
This script builds that corpus once, offline-reusable: it samples postcodes
|
||||
from the register (an unbiased spread over dates/regions), then for each
|
||||
postcode downloads **every** domestic cert's full schema payload — the exact
|
||||
shape `EpcPropertyDataMapper.from_api_response` consumes — grouped on disk by
|
||||
postcode. The validation harness then runs entirely against this cache: fast,
|
||||
deterministic, no rate limits.
|
||||
|
||||
Pair it with `validate_epc_prediction.py` (the leave-one-out accuracy harness).
|
||||
|
||||
HOW THE SAMPLE IS DRAWN
|
||||
-----------------------
|
||||
Postcodes are seeded by sampling random PAGES of `/api/domestic/search` across
|
||||
a past date window (the register orders by registration date, so random pages
|
||||
give an unbiased postcode spread). Each seed cert contributes its postcode; we
|
||||
take the first N distinct postcodes and pull each one's *entire* cohort via
|
||||
`search_by_postcode` -> per-cert `/api/certificate`.
|
||||
|
||||
USAGE
|
||||
-----
|
||||
PYTHONPATH=. python scripts/fetch_epc_prediction_corpus.py
|
||||
|
||||
Resumable — re-running skips certs already cached, so it is safe to interrupt.
|
||||
Token is read from `backend/.env` (`OPEN_EPC_API_TOKEN`). The register rejects
|
||||
a `date_end` that includes today, so keep the window in the past.
|
||||
|
||||
Cache dir defaults to `/tmp/epc_prediction_corpus`, overridable via the
|
||||
`EPC_PREDICTION_CORPUS` env var. Layout:
|
||||
<cache>/<POSTCODE_NOSPACE>/<cert_number>.json # raw API `data` payload
|
||||
<cache>/_index.json # {postcode: [cert, ...]}
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import time
|
||||
from pathlib import Path
|
||||
|
||||
import httpx
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv("backend/.env")
|
||||
TOKEN = os.environ["OPEN_EPC_API_TOKEN"]
|
||||
BASE = "https://api.get-energy-performance-data.communities.gov.uk"
|
||||
H = {"Authorization": f"Bearer {TOKEN}", "Accept": "application/json"}
|
||||
CACHE = Path(os.environ.get("EPC_PREDICTION_CORPUS", "/tmp/epc_prediction_corpus"))
|
||||
CACHE.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Seed-postcode sampling. `date_end` must be strictly before today. TOTAL_PAGES
|
||||
# is the `totalPages` the search returns for this window at page_size=100 —
|
||||
# re-probe if you change the window (it only needs to be an upper bound for the
|
||||
# random page draw; out-of-range pages just return fewer rows).
|
||||
WINDOW = {"date_start": "2026-01-01", "date_end": "2026-05-31"}
|
||||
TOTAL_PAGES = 7402
|
||||
SEED_PAGES = 20 # random search pages → postcode seeds
|
||||
N_POSTCODES = 150 # distinct postcodes to pull full cohorts for
|
||||
random.seed(2026) # reproducible draw
|
||||
|
||||
|
||||
def _get(url: str, params: dict[str, object], timeout: float = 20.0, tries: int = 5):
|
||||
"""GET with retry/backoff on 429 + 5xx (honours Retry-After)."""
|
||||
r = None
|
||||
for i in range(tries):
|
||||
try:
|
||||
r = httpx.get(url, params=params, headers=H, timeout=timeout)
|
||||
except httpx.HTTPError:
|
||||
time.sleep(1.5 * (i + 1))
|
||||
continue
|
||||
if r.status_code == 429 or r.status_code >= 500:
|
||||
ra = r.headers.get("Retry-After")
|
||||
time.sleep(float(ra) if ra else 1.5 * (i + 1))
|
||||
continue
|
||||
return r
|
||||
return r
|
||||
|
||||
|
||||
def _normalise_postcode(postcode: str) -> str:
|
||||
return postcode.replace(" ", "").upper()
|
||||
|
||||
|
||||
def sample_postcodes() -> list[str]:
|
||||
"""Draw distinct postcodes from random search pages across the window."""
|
||||
pages = sorted(random.sample(range(1, TOTAL_PAGES + 1), SEED_PAGES))
|
||||
seen: dict[str, None] = {}
|
||||
for p in pages:
|
||||
r = _get(
|
||||
f"{BASE}/api/domestic/search",
|
||||
{**WINDOW, "current_page": p, "page_size": 100},
|
||||
)
|
||||
if r is None or not r.is_success:
|
||||
print(f" seed page {p} -> {getattr(r, 'status_code', 'ERR')}")
|
||||
continue
|
||||
for row in r.json().get("data", []):
|
||||
pc = row.get("postcode")
|
||||
if pc:
|
||||
seen[_normalise_postcode(pc)] = None
|
||||
print(f" page {p}: cumulative {len(seen)} distinct postcodes")
|
||||
if len(seen) >= N_POSTCODES:
|
||||
break
|
||||
return list(seen)[:N_POSTCODES]
|
||||
|
||||
|
||||
def cohort_cert_numbers(postcode: str) -> list[str]:
|
||||
r = _get(f"{BASE}/api/domestic/search", {"postcode": postcode})
|
||||
if r is None or not r.is_success:
|
||||
return []
|
||||
return [
|
||||
row["certificateNumber"]
|
||||
for row in r.json().get("data", [])
|
||||
if row.get("certificateNumber")
|
||||
]
|
||||
|
||||
|
||||
def fetch_cert(postcode: str, cert: str) -> bool:
|
||||
"""Fetch + cache one cert's raw `data` payload. Returns True on success
|
||||
(or already-cached)."""
|
||||
out = CACHE / postcode / f"{cert}.json"
|
||||
if out.exists():
|
||||
return True
|
||||
r = _get(f"{BASE}/api/certificate", {"certificate_number": cert})
|
||||
if r is None or not r.is_success:
|
||||
return False
|
||||
try:
|
||||
payload = r.json()["data"]
|
||||
except (KeyError, ValueError):
|
||||
return False
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(payload))
|
||||
return True
|
||||
|
||||
|
||||
def main() -> None:
|
||||
print("sampling seed postcodes ...")
|
||||
postcodes = sample_postcodes()
|
||||
print(f"pulling full cohorts for {len(postcodes)} postcodes into {CACHE} ...")
|
||||
index: dict[str, list[str]] = {}
|
||||
t0 = time.time()
|
||||
total_certs = 0
|
||||
for i, pc in enumerate(postcodes, 1):
|
||||
certs = cohort_cert_numbers(pc)
|
||||
fetched = [c for c in certs if fetch_cert(pc, c)]
|
||||
index[pc] = fetched
|
||||
total_certs += len(fetched)
|
||||
print(f" [{i}/{len(postcodes)}] {pc}: {len(fetched)}/{len(certs)} certs")
|
||||
(CACHE / "_index.json").write_text(json.dumps(index, indent=2))
|
||||
print(
|
||||
f"DONE in {time.time() - t0:.0f}s: {len(postcodes)} postcodes, "
|
||||
f"{total_certs} certs cached under {CACHE}"
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
197
scripts/fetch_epc_prediction_dense_corpus.py
Normal file
197
scripts/fetch_epc_prediction_dense_corpus.py
Normal file
|
|
@ -0,0 +1,197 @@
|
|||
"""Build a *geographically dense* postcode-clustered corpus for EPC Prediction
|
||||
(cross-postcode geo expansion — follow-up to ADR-0029 / issue #1227, #1237).
|
||||
|
||||
WHY A SECOND CORPUS
|
||||
-------------------
|
||||
`fetch_epc_prediction_corpus.py` samples *scattered* national postcodes — fine
|
||||
for intra-postcode validation, but a held-out target's true geo-neighbours (the
|
||||
adjacent postcodes on its street) are NOT in that corpus, so the cross-postcode
|
||||
geo lever (distance-weighting a cohort that spans postcode boundaries) and
|
||||
built-form-aware sizing (#1237) cannot be measured on it.
|
||||
|
||||
This builds dense clusters instead: each of K reproducible seed postcodes is
|
||||
expanded — via postcodes.io's nearest-postcode endpoint — into EVERY unit
|
||||
postcode within `RADIUS_M`, and each of those gets its full EPC cohort pulled.
|
||||
The result is a handful of dense neighbourhoods (a target's real neighbours ARE
|
||||
in-corpus) spread across the country (the seeds are nationally sampled, so the
|
||||
validation set stays diverse).
|
||||
|
||||
postcodes.io is a CORPUS-BUILD dependency only (a free, public, OGL postcode
|
||||
service) — the predictor stays pure. The gov EPC API has no area/prefix search
|
||||
(a partial postcode 400s; only a full unit is accepted), which is why the
|
||||
neighbour enumeration is external.
|
||||
|
||||
USAGE
|
||||
-----
|
||||
PYTHONPATH=. python scripts/fetch_epc_prediction_dense_corpus.py # full
|
||||
PYTHONPATH=. python scripts/fetch_epc_prediction_dense_corpus.py --pilot # 2 seeds
|
||||
|
||||
Resumable — re-running skips cached certs. Token from `backend/.env`. Cache dir
|
||||
defaults to `/tmp/epc_prediction_dense_corpus` (separate from the scattered one),
|
||||
overridable via `EPC_PREDICTION_DENSE_CORPUS`. Layout matches the other corpus
|
||||
(`<POSTCODE_NOSPACE>/<cert>.json` + `_index.json`), so `load_corpus` and the
|
||||
coordinate resolver consume it unchanged.
|
||||
"""
|
||||
|
||||
import json
|
||||
import os
|
||||
import random
|
||||
import sys
|
||||
import time
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
|
||||
import httpx
|
||||
from dotenv import load_dotenv
|
||||
|
||||
load_dotenv("backend/.env")
|
||||
TOKEN = os.environ["OPEN_EPC_API_TOKEN"]
|
||||
BASE = "https://api.get-energy-performance-data.communities.gov.uk"
|
||||
H = {"Authorization": f"Bearer {TOKEN}", "Accept": "application/json"}
|
||||
POSTCODES_IO = "https://api.postcodes.io"
|
||||
CACHE = Path(
|
||||
os.environ.get("EPC_PREDICTION_DENSE_CORPUS", "/tmp/epc_prediction_dense_corpus")
|
||||
)
|
||||
CACHE.mkdir(parents=True, exist_ok=True)
|
||||
|
||||
# Seed sampling mirrors the scattered fetch (random search pages → an unbiased
|
||||
# national postcode spread), then each seed is densified. `date_end` must be
|
||||
# strictly before today.
|
||||
WINDOW = {"date_start": "2026-01-01", "date_end": "2026-05-31"}
|
||||
TOTAL_PAGES = 7402
|
||||
SEED_PAGES = 8 # random search pages → seed postcodes
|
||||
N_SEEDS = 25 # dense neighbourhood clusters to build
|
||||
RADIUS_M = 300 # postcodes.io nearest-postcode radius around each seed
|
||||
MAX_PER_SEED = 60 # cap unit postcodes per seed (dense urban seeds can be huge)
|
||||
random.seed(2026) # reproducible draw
|
||||
|
||||
|
||||
def _get(url: str, params: dict[str, Any], headers: Optional[dict[str, str]] = None,
|
||||
timeout: float = 20.0, tries: int = 5):
|
||||
"""GET with retry/backoff on 429 + 5xx (honours Retry-After)."""
|
||||
r = None
|
||||
for i in range(tries):
|
||||
try:
|
||||
r = httpx.get(url, params=params, headers=headers or {}, timeout=timeout)
|
||||
except httpx.HTTPError:
|
||||
time.sleep(1.5 * (i + 1))
|
||||
continue
|
||||
if r.status_code == 429 or r.status_code >= 500:
|
||||
ra = r.headers.get("Retry-After")
|
||||
time.sleep(float(ra) if ra else 1.5 * (i + 1))
|
||||
continue
|
||||
return r
|
||||
return r
|
||||
|
||||
|
||||
def _normalise_postcode(postcode: str) -> str:
|
||||
return postcode.replace(" ", "").upper()
|
||||
|
||||
|
||||
def sample_seed_postcodes(n_seeds: int) -> list[str]:
|
||||
"""Draw distinct seed postcodes from random search pages across the window."""
|
||||
pages = sorted(random.sample(range(1, TOTAL_PAGES + 1), SEED_PAGES))
|
||||
seen: dict[str, None] = {}
|
||||
for p in pages:
|
||||
r = _get(
|
||||
f"{BASE}/api/domestic/search",
|
||||
{**WINDOW, "current_page": p, "page_size": 100},
|
||||
headers=H,
|
||||
)
|
||||
if r is None or not r.is_success:
|
||||
print(f" seed page {p} -> {getattr(r, 'status_code', 'ERR')}")
|
||||
continue
|
||||
for row in r.json().get("data", []):
|
||||
pc = row.get("postcode")
|
||||
if pc:
|
||||
seen[pc] = None
|
||||
if len(seen) >= n_seeds:
|
||||
break
|
||||
return list(seen)[:n_seeds]
|
||||
|
||||
|
||||
def nearby_postcodes(seed: str) -> list[str]:
|
||||
"""Every unit postcode within `RADIUS_M` of `seed`, via postcodes.io's
|
||||
nearest-postcode endpoint (seeded on the seed's own coordinates). Returns the
|
||||
seed itself plus its neighbours (deduped, capped)."""
|
||||
s = _get(f"{POSTCODES_IO}/postcodes/{seed.replace(' ', '%20')}", {})
|
||||
if s is None or not s.is_success:
|
||||
return [seed]
|
||||
res: dict[str, Any] = s.json().get("result") or {}
|
||||
lat: Any = res.get("latitude")
|
||||
lon: Any = res.get("longitude")
|
||||
if lat is None or lon is None:
|
||||
return [seed]
|
||||
r = _get(
|
||||
f"{POSTCODES_IO}/postcodes",
|
||||
{"lon": lon, "lat": lat, "radius": RADIUS_M, "limit": 100},
|
||||
)
|
||||
if r is None or not r.is_success:
|
||||
return [seed]
|
||||
items: list[dict[str, Any]] = r.json().get("result") or []
|
||||
found: list[str] = [str(x["postcode"]) for x in items if x.get("postcode")]
|
||||
ordered = [seed] + [p for p in found if p != seed]
|
||||
return ordered[:MAX_PER_SEED]
|
||||
|
||||
|
||||
def cohort_cert_numbers(postcode: str) -> list[str]:
|
||||
r = _get(f"{BASE}/api/domestic/search", {"postcode": postcode}, headers=H)
|
||||
if r is None or not r.is_success:
|
||||
return []
|
||||
return [
|
||||
row["certificateNumber"]
|
||||
for row in r.json().get("data", [])
|
||||
if row.get("certificateNumber")
|
||||
]
|
||||
|
||||
|
||||
def fetch_cert(postcode_nospace: str, cert: str) -> bool:
|
||||
"""Fetch + cache one cert's raw `data` payload (True on success / cached)."""
|
||||
out = CACHE / postcode_nospace / f"{cert}.json"
|
||||
if out.exists():
|
||||
return True
|
||||
r = _get(f"{BASE}/api/certificate", {"certificate_number": cert}, headers=H)
|
||||
if r is None or not r.is_success:
|
||||
return False
|
||||
try:
|
||||
payload = r.json()["data"]
|
||||
except (KeyError, ValueError):
|
||||
return False
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
out.write_text(json.dumps(payload))
|
||||
return True
|
||||
|
||||
|
||||
def main() -> None:
|
||||
pilot = "--pilot" in sys.argv
|
||||
n_seeds = 2 if pilot else N_SEEDS
|
||||
print(f"sampling {n_seeds} seed postcodes ...")
|
||||
seeds = sample_seed_postcodes(n_seeds)
|
||||
print(f"seeds: {seeds}")
|
||||
|
||||
index: dict[str, list[str]] = {}
|
||||
t0 = time.time()
|
||||
total_certs = 0
|
||||
for si, seed in enumerate(seeds, 1):
|
||||
neighbourhood = nearby_postcodes(seed)
|
||||
print(f"\n[seed {si}/{len(seeds)}] {seed}: {len(neighbourhood)} postcodes "
|
||||
f"within {RADIUS_M}m")
|
||||
for pc in neighbourhood:
|
||||
nospace = _normalise_postcode(pc)
|
||||
if nospace in index:
|
||||
continue # neighbourhoods can overlap; fetch each postcode once
|
||||
certs = cohort_cert_numbers(pc)
|
||||
fetched = [c for c in certs if fetch_cert(nospace, c)]
|
||||
if fetched:
|
||||
index[nospace] = fetched
|
||||
total_certs += len(fetched)
|
||||
print(f" cumulative: {len(index)} postcodes, {total_certs} certs")
|
||||
(CACHE / "_index.json").write_text(json.dumps(index, indent=2))
|
||||
print(
|
||||
f"\nDONE in {time.time() - t0:.0f}s: {len(seeds)} seeds, "
|
||||
f"{len(index)} postcodes, {total_certs} certs under {CACHE}"
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
177
scripts/validate_epc_prediction.py
Normal file
177
scripts/validate_epc_prediction.py
Normal file
|
|
@ -0,0 +1,177 @@
|
|||
"""Leave-one-out accuracy harness for EPC Prediction (ADR-0029).
|
||||
|
||||
Runs entirely against the frozen postcode-clustered corpus
|
||||
(`fetch_epc_prediction_corpus.py`). For every cert that has neighbours, it
|
||||
drops that cert from its postcode cohort, predicts it from the rest using only
|
||||
its *guaranteed* inputs (property type + built form), and compares the predicted
|
||||
`EpcPropertyData` to the actual one.
|
||||
|
||||
Reports the ADR-0029 metrics:
|
||||
- classification rate: main wall construction (extend as coverage grows);
|
||||
- geometry residuals: floor area, window count + total window area, building
|
||||
parts (mean signed + mean absolute);
|
||||
- SAP reported three ways — predicted-then-calculated vs (a) the actual lodged
|
||||
SAP, (b) the calculator on the actual components, (c) the neighbour-mean SAP
|
||||
baseline (the number predict-then-calculate must beat).
|
||||
|
||||
USAGE
|
||||
-----
|
||||
PYTHONPATH=. python scripts/validate_epc_prediction.py
|
||||
|
||||
Corpus dir: $EPC_PREDICTION_CORPUS (default /tmp/epc_prediction_corpus).
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import os
|
||||
import statistics
|
||||
from pathlib import Path
|
||||
from typing import Optional
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
||||
from domain.epc_prediction.comparable_properties import ComparableProperty
|
||||
from domain.epc_prediction.validation import (
|
||||
evaluate_component_accuracy,
|
||||
iter_predictions,
|
||||
)
|
||||
from domain.sap10_calculator.calculator import Sap10Calculator, SapResult
|
||||
from harness.epc_prediction_corpus import load_corpus
|
||||
|
||||
_KG_PER_TONNE: float = 1000.0
|
||||
|
||||
CORPUS = Path(os.environ.get("EPC_PREDICTION_CORPUS", "/tmp/epc_prediction_corpus"))
|
||||
|
||||
|
||||
def _result(
|
||||
calculator: Sap10Calculator, epc: EpcPropertyData
|
||||
) -> Optional[SapResult]:
|
||||
try:
|
||||
return calculator.calculate(epc)
|
||||
except Exception: # noqa: BLE001 — some pictures don't score; count as misses
|
||||
return None
|
||||
|
||||
|
||||
def _co2_tonnes(result: SapResult) -> float:
|
||||
"""Calculated annual CO2 in tonnes, matching the lodged `co2_emissions_current`
|
||||
scale (see domain/property_baseline/performance.py)."""
|
||||
return result.co2_kg_per_yr / _KG_PER_TONNE
|
||||
|
||||
|
||||
def main() -> None:
|
||||
cohorts = load_corpus(CORPUS)
|
||||
calculator = Sap10Calculator()
|
||||
|
||||
# PRIMARY signal — Component Accuracy, calculator-free (the shared scorer).
|
||||
accuracy = evaluate_component_accuracy(cohorts)
|
||||
|
||||
print(f"corpus: {CORPUS}")
|
||||
print(f"predicted {accuracy.targets} SAP-10.2 held-out targets\n")
|
||||
print("--- Component Accuracy (PRIMARY, calculator-independent) ---")
|
||||
for name, (hits, total) in accuracy.classification.items():
|
||||
if total:
|
||||
print(f"CLASSIFICATION {name}: {hits}/{total} = {hits / total:.1%}")
|
||||
print()
|
||||
_floor_area_error(cohorts)
|
||||
_residual("floor_area (m2)", accuracy.residuals.get("floor_area", []))
|
||||
_residual("window_count", accuracy.residuals.get("window_count", []))
|
||||
_residual(
|
||||
"total_window_area (m2)", accuracy.residuals.get("total_window_area", [])
|
||||
)
|
||||
_residual("building_parts", accuracy.residuals.get("building_parts", []))
|
||||
_residual("door_count", accuracy.residuals.get("door_count", []))
|
||||
|
||||
# SECONDARY guard — end-to-end vs API-lodged, calculator-FLOORED. Re-walks the
|
||||
# same held-out targets (one orchestration via iter_predictions).
|
||||
sap_vs_lodged: list[float] = []
|
||||
co2_vs_lodged: list[float] = []
|
||||
pei_vs_lodged: list[float] = []
|
||||
# Calculator floors — calc(actual) vs lodged — per metric. Each is the error
|
||||
# the end-to-end cannot beat (the API-path mapper/calculator residual, a
|
||||
# separate workstream), so it attributes how much of a metric's pred-vs-lodged
|
||||
# gap is the calculator vs the prediction. PEI carries a far larger floor than
|
||||
# SAP (~16 vs ~1.6 kWh/m2 / pts), so the headline PEI MAE must not be read as
|
||||
# pure prediction error (issue #1228).
|
||||
sap_floor: list[float] = []
|
||||
co2_floor: list[float] = []
|
||||
pei_floor: list[float] = []
|
||||
for predicted, actual in iter_predictions(cohorts):
|
||||
pred_result = _result(calculator, predicted)
|
||||
actual_result = _result(calculator, actual)
|
||||
lodged_sap = actual.energy_rating_current
|
||||
lodged_co2 = actual.co2_emissions_current
|
||||
lodged_pei = actual.energy_consumption_current
|
||||
if pred_result is not None:
|
||||
if lodged_sap is not None:
|
||||
sap_vs_lodged.append(
|
||||
abs(pred_result.sap_score_continuous - lodged_sap)
|
||||
)
|
||||
if lodged_co2 is not None:
|
||||
co2_vs_lodged.append(abs(_co2_tonnes(pred_result) - lodged_co2))
|
||||
if lodged_pei is not None:
|
||||
pei_vs_lodged.append(
|
||||
abs(pred_result.primary_energy_kwh_per_m2 - lodged_pei)
|
||||
)
|
||||
if actual_result is not None:
|
||||
if lodged_sap is not None:
|
||||
sap_floor.append(
|
||||
abs(actual_result.sap_score_continuous - lodged_sap)
|
||||
)
|
||||
if lodged_co2 is not None:
|
||||
co2_floor.append(abs(_co2_tonnes(actual_result) - lodged_co2))
|
||||
if lodged_pei is not None:
|
||||
pei_floor.append(
|
||||
abs(actual_result.primary_energy_kwh_per_m2 - lodged_pei)
|
||||
)
|
||||
|
||||
print()
|
||||
print("--- End-to-end vs API-lodged (SECONDARY, calculator-FLOORED) ---")
|
||||
_sap_line("SAP |pred − lodged|", sap_vs_lodged)
|
||||
_sap_line("CO2 (t) |pred − lodged|", co2_vs_lodged)
|
||||
_sap_line("PEI (kWh/m2) |pred − lodged|", pei_vs_lodged)
|
||||
_sap_line(" floor: SAP |calc(actual) − lodged|", sap_floor)
|
||||
_sap_line(" floor: CO2 |calc(actual) − lodged|", co2_floor)
|
||||
_sap_line(" floor: PEI |calc(actual) − lodged|", pei_floor)
|
||||
|
||||
|
||||
def _floor_area_error(cohorts: list[list[ComparableProperty]]) -> None:
|
||||
"""Floor-area accuracy as MAE (m²) and MAPE (% of the actual), plus the
|
||||
typical (median actual) size — so the absolute error can be read relative to
|
||||
how big dwellings are. The predicted area is the cohort median, set
|
||||
independently of the geo/similarity weighting that drives the categoricals."""
|
||||
pairs = [
|
||||
(predicted.total_floor_area_m2, actual.total_floor_area_m2)
|
||||
for predicted, actual in iter_predictions(cohorts)
|
||||
]
|
||||
valid = [(p, a) for p, a in pairs if a]
|
||||
if not valid:
|
||||
print("RESIDUAL floor_area: (none)")
|
||||
return
|
||||
mae = statistics.mean(abs(p - a) for p, a in valid)
|
||||
mape = statistics.mean(abs(p - a) / a for p, a in valid)
|
||||
typical = statistics.median(a for _, a in valid)
|
||||
print(
|
||||
f"RESIDUAL floor_area: MAE {mae:.2f} m2 | MAPE {mape:.1%} | "
|
||||
f"typical (median actual) {typical:.0f} m2 (n={len(valid)})"
|
||||
)
|
||||
|
||||
|
||||
def _residual(label: str, values: list[float]) -> None:
|
||||
if not values:
|
||||
print(f"RESIDUAL {label}: (none)")
|
||||
return
|
||||
mean_signed = statistics.mean(values)
|
||||
mean_abs = statistics.mean(abs(v) for v in values)
|
||||
print(f"RESIDUAL {label}: mean {mean_signed:+.2f} | mean|·| {mean_abs:.2f} "
|
||||
f"(n={len(values)})")
|
||||
|
||||
|
||||
def _sap_line(label: str, values: list[float]) -> None:
|
||||
if not values:
|
||||
print(f"{label}: (none)")
|
||||
return
|
||||
print(f"{label}: MAE {statistics.mean(values):.2f} | "
|
||||
f"median {statistics.median(values):.2f} (n={len(values)})")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
0
tests/domain/epc_prediction/__init__.py
Normal file
0
tests/domain/epc_prediction/__init__.py
Normal file
173
tests/domain/epc_prediction/test_comparable_properties.py
Normal file
173
tests/domain/epc_prediction/test_comparable_properties.py
Normal file
|
|
@ -0,0 +1,173 @@
|
|||
"""Behaviour of ComparableProperty Properties selection (ADR-0029): given a prediction
|
||||
target's known inputs and the raw postcode cohort, choose + weight the
|
||||
comparables EPC Prediction will synthesise from. Filter-then-relax ladder:
|
||||
hard filters on identity (property type, built form) + known overrides while
|
||||
enough remain, weighted by recency × similarity. Pure domain logic.
|
||||
"""
|
||||
|
||||
from datetime import date
|
||||
from typing import Optional, Union
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData, SapBuildingPart
|
||||
from domain.epc_prediction.comparable_properties import (
|
||||
ComparableProperty,
|
||||
ComparableProperties,
|
||||
select_comparables,
|
||||
)
|
||||
from domain.epc_prediction.prediction_target import PredictionTarget
|
||||
|
||||
|
||||
def _comparable(
|
||||
*,
|
||||
property_type: str,
|
||||
certificate_number: str,
|
||||
built_form: str = "1",
|
||||
wall_construction: Optional[Union[int, str]] = None,
|
||||
address: Optional[str] = None,
|
||||
registration_date: Optional[date] = None,
|
||||
) -> ComparableProperty:
|
||||
"""A ComparableProperty carrying only the fields under test (opaque EpcPropertyData
|
||||
with property_type / built_form / main wall set — the partial-instance idiom)."""
|
||||
epc: EpcPropertyData = object.__new__(EpcPropertyData)
|
||||
epc.property_type = property_type
|
||||
epc.built_form = built_form
|
||||
main: SapBuildingPart = object.__new__(SapBuildingPart)
|
||||
if wall_construction is not None:
|
||||
main.wall_construction = wall_construction
|
||||
epc.sap_building_parts = [main]
|
||||
return ComparableProperty(
|
||||
epc=epc,
|
||||
certificate_number=certificate_number,
|
||||
address=address,
|
||||
registration_date=registration_date,
|
||||
)
|
||||
|
||||
|
||||
def test_selects_only_candidates_of_the_same_property_type() -> None:
|
||||
# Arrange — a target house (property_type "2"); cohort of 2 houses + 1 flat.
|
||||
target = PredictionTarget(postcode="LS6 1AA", property_type="2")
|
||||
candidates = [
|
||||
_comparable(property_type="2", certificate_number="A"),
|
||||
_comparable(property_type="2", certificate_number="B"),
|
||||
_comparable(property_type="1", certificate_number="C"),
|
||||
]
|
||||
|
||||
# Act
|
||||
result: ComparableProperties = select_comparables(target, candidates)
|
||||
|
||||
# Assert — the flat is excluded; the two houses remain.
|
||||
assert {c.certificate_number for c in result.members} == {"A", "B"}
|
||||
|
||||
|
||||
def test_dedupes_re_lodgements_to_the_latest_cert_per_address() -> None:
|
||||
# Arrange — a register cohort with one address (FLAT 3) lodged three times.
|
||||
# Comparables are one-per-real-neighbour, so a re-lodged address must not
|
||||
# count three times towards the mode; the latest cert is its current state.
|
||||
target = PredictionTarget(postcode="LS6 1AA", property_type="2")
|
||||
candidates = [
|
||||
_comparable(
|
||||
property_type="2",
|
||||
certificate_number="OLD",
|
||||
address="FLAT 3",
|
||||
registration_date=date(2020, 4, 6),
|
||||
),
|
||||
_comparable(
|
||||
property_type="2",
|
||||
certificate_number="MID",
|
||||
address="FLAT 3",
|
||||
registration_date=date(2021, 2, 1),
|
||||
),
|
||||
_comparable(
|
||||
property_type="2",
|
||||
certificate_number="NEW",
|
||||
address="FLAT 3",
|
||||
registration_date=date(2025, 1, 20),
|
||||
),
|
||||
_comparable(
|
||||
property_type="2",
|
||||
certificate_number="OTHER",
|
||||
address="FLAT 5",
|
||||
registration_date=date(2024, 9, 27),
|
||||
),
|
||||
]
|
||||
|
||||
# Act
|
||||
result: ComparableProperties = select_comparables(target, candidates)
|
||||
|
||||
# Assert — FLAT 3 collapses to its latest cert; FLAT 5 is untouched.
|
||||
assert {c.certificate_number for c in result.members} == {"NEW", "OTHER"}
|
||||
|
||||
|
||||
def test_filters_to_the_known_built_form_when_enough_remain() -> None:
|
||||
# Arrange — a mid-terrace target (built_form "4"); cohort of 5 mid-terraces
|
||||
# + 2 detached, all houses. The built form is known and leaves ≥ k, so it is
|
||||
# applied as a hard filter.
|
||||
target = PredictionTarget(
|
||||
postcode="LS6 1AA", property_type="2", built_form="4"
|
||||
)
|
||||
candidates = [
|
||||
_comparable(property_type="2", built_form="4", certificate_number=f"T{i}")
|
||||
for i in range(5)
|
||||
] + [
|
||||
_comparable(property_type="2", built_form="1", certificate_number=f"D{i}")
|
||||
for i in range(2)
|
||||
]
|
||||
|
||||
# Act
|
||||
result: ComparableProperties = select_comparables(
|
||||
target, candidates, minimum_cohort=5
|
||||
)
|
||||
|
||||
# Assert — only the five mid-terraces survive.
|
||||
assert {c.certificate_number for c in result.members} == {
|
||||
"T0", "T1", "T2", "T3", "T4"
|
||||
}
|
||||
|
||||
|
||||
def test_known_wall_override_emphasises_matching_comparables() -> None:
|
||||
# Arrange — a mixed street: 5 solid-brick (code 2) + 3 cavity (code 1) houses.
|
||||
# We KNOW the target is solid brick (a Landlord Override), and the filter
|
||||
# leaves ≥ k, so cavity neighbours are dropped (the border-property case).
|
||||
target = PredictionTarget(
|
||||
postcode="LS6 1AA", property_type="2", wall_construction=2
|
||||
)
|
||||
candidates = [
|
||||
_comparable(property_type="2", wall_construction=2, certificate_number=f"S{i}")
|
||||
for i in range(5)
|
||||
] + [
|
||||
_comparable(property_type="2", wall_construction=1, certificate_number=f"C{i}")
|
||||
for i in range(3)
|
||||
]
|
||||
|
||||
# Act
|
||||
result: ComparableProperties = select_comparables(
|
||||
target, candidates, minimum_cohort=5
|
||||
)
|
||||
|
||||
# Assert — only the solid-brick comparables remain.
|
||||
assert {c.certificate_number for c in result.members} == {
|
||||
"S0", "S1", "S2", "S3", "S4"
|
||||
}
|
||||
|
||||
|
||||
def test_known_wall_override_relaxes_when_too_few_match() -> None:
|
||||
# Arrange — only 2 solid-brick but 6 cavity houses; the override would leave
|
||||
# 2 (< k=5), so it relaxes to keep the full type cohort (graceful degradation).
|
||||
target = PredictionTarget(
|
||||
postcode="LS6 1AA", property_type="2", wall_construction=2
|
||||
)
|
||||
candidates = [
|
||||
_comparable(property_type="2", wall_construction=2, certificate_number=f"S{i}")
|
||||
for i in range(2)
|
||||
] + [
|
||||
_comparable(property_type="2", wall_construction=1, certificate_number=f"C{i}")
|
||||
for i in range(6)
|
||||
]
|
||||
|
||||
# Act
|
||||
result: ComparableProperties = select_comparables(
|
||||
target, candidates, minimum_cohort=5
|
||||
)
|
||||
|
||||
# Assert — relaxed: all eight houses retained.
|
||||
assert len(result.members) == 8
|
||||
118
tests/domain/epc_prediction/test_component_accuracy_gate.py
Normal file
118
tests/domain/epc_prediction/test_component_accuracy_gate.py
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
"""Tier-1 ratcheting Component Accuracy gate (ADR-0030).
|
||||
|
||||
Runs the calculator-free leave-one-out scorer over the committed, anonymised
|
||||
fixture and asserts each per-component classification rate / geometry residual is
|
||||
no worse than a committed baseline. Because the prediction is deterministic and
|
||||
the fixture is frozen, every run reproduces the same numbers exactly — so a
|
||||
failure means a real *regression* in prediction quality, never sample noise.
|
||||
|
||||
The floors / ceilings are the currently-measured values and only ever **tighten**
|
||||
(the repo's no-tolerance-widening ethos applied to an aggregate): when prediction
|
||||
improves, ratchet the relevant floor up in the same change. The end-to-end
|
||||
SAP / carbon / PE guards are deliberately *not* here — they need the calculator,
|
||||
whose API-path residual is a separate workstream; the component floors are the
|
||||
real gate (ADR-0030).
|
||||
"""
|
||||
|
||||
from pathlib import Path
|
||||
|
||||
import pytest
|
||||
|
||||
from domain.epc_prediction.validation import (
|
||||
ComponentAccuracy,
|
||||
evaluate_component_accuracy,
|
||||
)
|
||||
from harness.epc_prediction_corpus import load_corpus
|
||||
|
||||
_FIXTURE = Path(__file__).parents[3] / "tests" / "fixtures" / "epc_prediction"
|
||||
|
||||
# Minimum classification hit-rate per component (ratchet floors). Tighten — never
|
||||
# loosen — as prediction improves. Values are the measured rates over the frozen
|
||||
# 36-target fixture; a 1e-3 tolerance absorbs float rounding only.
|
||||
_RATE_FLOORS: dict[str, float] = {
|
||||
"wall_construction": 0.8889,
|
||||
"wall_insulation_type": 0.8333,
|
||||
"construction_age_band": 0.6389,
|
||||
"construction_age_band_pm1": 0.8333,
|
||||
"roof_construction": 0.7222,
|
||||
"floor_construction": 0.8125,
|
||||
"heating_main_fuel": 0.9722,
|
||||
"heating_main_category": 0.9444,
|
||||
"heating_main_control": 0.8056,
|
||||
"water_heating_fuel": 0.9722,
|
||||
"water_heating_code": 0.9444,
|
||||
"has_hot_water_cylinder": 0.8889,
|
||||
"cylinder_insulation_type": 0.5000,
|
||||
"secondary_heating_type": 0.0000,
|
||||
"roof_insulation_thickness": 0.4118,
|
||||
"roof_insulation_thickness_pm1": 0.4118,
|
||||
"floor_insulation": 0.9375,
|
||||
"has_room_in_roof": 0.8333,
|
||||
"modal_glazing_type": 0.5556,
|
||||
"has_pv": 1.0000,
|
||||
"solar_water_heating": 1.0000,
|
||||
}
|
||||
|
||||
# Maximum mean absolute residual per numeric component (ratchet ceilings).
|
||||
# window_count is deliberately excluded — it is cosmetic for SAP (issue #1222):
|
||||
# the predicted picture clusters at a mapper-default 4 windows while actuals
|
||||
# spread 1-21, yet total_window_area (the SAP-relevant signal) stays tight.
|
||||
#
|
||||
# floor_area was re-baselined 11.8983 -> 12.0378 when floor-area sizing moved from
|
||||
# the plain cohort median to the geo-proximity-weighted median (a *method* change,
|
||||
# not a loosening). The change is a clear win on the full 514-target corpus
|
||||
# (MAE 10.48 -> 9.73 / MAPE 13.2% -> 12.2%); the n=36 frozen fixture moved +0.14
|
||||
# the other way as small-sample noise (one target's shift moves an n=36 MAE more
|
||||
# than that). The ceiling still pins the new deterministic value exactly, so the
|
||||
# tighten-only ratchet resumes from here.
|
||||
_RESIDUAL_CEILINGS: dict[str, float] = {
|
||||
"floor_area": 12.0378,
|
||||
"total_window_area": 4.4067,
|
||||
"building_parts": 0.3333,
|
||||
"door_count": 0.6389,
|
||||
}
|
||||
|
||||
_TOLERANCE = 1e-3
|
||||
|
||||
|
||||
@pytest.fixture(scope="module")
|
||||
def accuracy() -> ComponentAccuracy:
|
||||
if not (_FIXTURE / "_index.json").exists():
|
||||
pytest.skip(f"no EPC Prediction fixture at {_FIXTURE}")
|
||||
return evaluate_component_accuracy(load_corpus(_FIXTURE))
|
||||
|
||||
|
||||
def test_fixture_yields_the_expected_target_count(
|
||||
accuracy: ComponentAccuracy,
|
||||
) -> None:
|
||||
# The frozen fixture must still produce its full set of SAP-10.2 targets — a
|
||||
# drop means the fixture or the target filter changed.
|
||||
assert accuracy.targets >= 36
|
||||
|
||||
|
||||
@pytest.mark.parametrize("component,floor", sorted(_RATE_FLOORS.items()))
|
||||
def test_classification_rate_does_not_regress(
|
||||
accuracy: ComponentAccuracy, component: str, floor: float
|
||||
) -> None:
|
||||
# Arrange / Act
|
||||
rate = accuracy.rate(component)
|
||||
|
||||
# Assert — the component is still applicable and at or above its floor.
|
||||
assert rate is not None, f"{component} had no applicable targets"
|
||||
assert rate >= floor - _TOLERANCE, (
|
||||
f"{component} classification regressed: {rate:.4f} < floor {floor:.4f}"
|
||||
)
|
||||
|
||||
|
||||
@pytest.mark.parametrize("component,ceiling", sorted(_RESIDUAL_CEILINGS.items()))
|
||||
def test_residual_does_not_regress(
|
||||
accuracy: ComponentAccuracy, component: str, ceiling: float
|
||||
) -> None:
|
||||
# Arrange / Act
|
||||
mean_abs = accuracy.mean_abs_residual(component)
|
||||
|
||||
# Assert — the mean absolute residual is at or below its ceiling.
|
||||
assert mean_abs is not None, f"{component} had no residuals"
|
||||
assert mean_abs <= ceiling + _TOLERANCE, (
|
||||
f"{component} residual regressed: {mean_abs:.4f} > ceiling {ceiling:.4f}"
|
||||
)
|
||||
545
tests/domain/epc_prediction/test_epc_prediction.py
Normal file
545
tests/domain/epc_prediction/test_epc_prediction.py
Normal file
|
|
@ -0,0 +1,545 @@
|
|||
"""Behaviour of EPC Prediction synthesis (ADR-0029): turn the selected
|
||||
ComparableProperty Properties into a predicted EpcPropertyData. Hybrid — copy a coherent
|
||||
representative template's structure (building parts, windows, geometry), set the
|
||||
homogeneous categoricals to the recency-weighted cohort mode, apply Landlord
|
||||
Overrides on top. Pure domain logic.
|
||||
"""
|
||||
|
||||
from datetime import date
|
||||
from typing import Optional, Union
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import (
|
||||
EpcPropertyData,
|
||||
MainHeatingDetail,
|
||||
SapBuildingPart,
|
||||
SapFloorDimension,
|
||||
SapHeating,
|
||||
SapWindow,
|
||||
)
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
from domain.epc_prediction.comparable_properties import (
|
||||
ComparableProperty,
|
||||
ComparableProperties,
|
||||
)
|
||||
from domain.epc_prediction.epc_prediction import (
|
||||
EpcPrediction,
|
||||
PredictionConfidence,
|
||||
)
|
||||
from domain.epc_prediction.prediction_target import PredictionTarget
|
||||
|
||||
|
||||
def _epc(
|
||||
*,
|
||||
building_parts: int = 1,
|
||||
floor_area: float = 80.0,
|
||||
wall_construction: Union[int, str] = 1,
|
||||
wall_insulation_type: Union[int, str] = 1,
|
||||
construction_age_band: str = "K",
|
||||
roof_construction: Optional[int] = 1,
|
||||
roof_insulation_thickness: Optional[Union[str, int]] = 100,
|
||||
floor_construction: Optional[int] = 1,
|
||||
floor_insulation: Optional[int] = 1,
|
||||
glazing_type: Union[int, str] = 3,
|
||||
main_fuel_type: Union[int, str] = 1,
|
||||
main_heating_category: Optional[int] = 1,
|
||||
main_heating_control: Union[int, str] = 1,
|
||||
water_heating_fuel: Optional[int] = 1,
|
||||
water_heating_code: Optional[int] = 1,
|
||||
has_hot_water_cylinder: bool = True,
|
||||
solar_water_heating: bool = False,
|
||||
) -> EpcPropertyData:
|
||||
epc: EpcPropertyData = object.__new__(EpcPropertyData)
|
||||
epc.property_type = "2"
|
||||
epc.built_form = "4"
|
||||
epc.total_floor_area_m2 = floor_area
|
||||
parts: list[SapBuildingPart] = []
|
||||
for _ in range(building_parts):
|
||||
part: SapBuildingPart = object.__new__(SapBuildingPart)
|
||||
part.wall_construction = wall_construction
|
||||
part.wall_insulation_type = wall_insulation_type
|
||||
part.construction_age_band = construction_age_band
|
||||
part.roof_construction = roof_construction
|
||||
part.roof_insulation_thickness = roof_insulation_thickness
|
||||
floor_dim: SapFloorDimension = object.__new__(SapFloorDimension)
|
||||
floor_dim.floor_construction = floor_construction
|
||||
floor_dim.floor_insulation = floor_insulation
|
||||
part.sap_floor_dimensions = [floor_dim]
|
||||
parts.append(part)
|
||||
epc.sap_building_parts = parts
|
||||
window: SapWindow = object.__new__(SapWindow)
|
||||
window.window_width = 1.0
|
||||
window.window_height = 1.0
|
||||
window.glazing_type = glazing_type
|
||||
epc.sap_windows = [window]
|
||||
heating: SapHeating = object.__new__(SapHeating)
|
||||
detail: MainHeatingDetail = object.__new__(MainHeatingDetail)
|
||||
detail.main_fuel_type = main_fuel_type
|
||||
detail.main_heating_category = main_heating_category
|
||||
detail.main_heating_control = main_heating_control
|
||||
heating.main_heating_details = [detail]
|
||||
heating.water_heating_fuel = water_heating_fuel
|
||||
heating.water_heating_code = water_heating_code
|
||||
heating.cylinder_insulation_type = 1
|
||||
heating.secondary_heating_type = None
|
||||
epc.sap_heating = heating
|
||||
epc.has_hot_water_cylinder = has_hot_water_cylinder
|
||||
epc.solar_water_heating = solar_water_heating
|
||||
return epc
|
||||
|
||||
|
||||
def _cohort(*epcs: EpcPropertyData) -> ComparableProperties:
|
||||
return ComparableProperties(
|
||||
members=tuple(
|
||||
ComparableProperty(epc=e, certificate_number=str(i)) for i, e in enumerate(epcs)
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def _dated_cohort(
|
||||
*dated: tuple[EpcPropertyData, date],
|
||||
) -> ComparableProperties:
|
||||
return ComparableProperties(
|
||||
members=tuple(
|
||||
ComparableProperty(epc=e, certificate_number=str(i), registration_date=d)
|
||||
for i, (e, d) in enumerate(dated)
|
||||
)
|
||||
)
|
||||
|
||||
|
||||
def test_predicts_a_picture_by_copying_a_representative_template() -> None:
|
||||
# Arrange — a single comparable with a distinctive structure (2 building
|
||||
# parts, 92 m²); with nothing else to go on it is the template.
|
||||
template = _epc(building_parts=2, floor_area=92.0)
|
||||
target = PredictionTarget(postcode="LS6 1AA", property_type="2")
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(target, _cohort(template))
|
||||
|
||||
# Assert — the structure is copied wholesale (and it is a copy, not the same
|
||||
# object — the baseline must never be mutated).
|
||||
assert len(predicted.sap_building_parts) == 2
|
||||
assert predicted.total_floor_area_m2 == 92.0
|
||||
assert predicted is not template
|
||||
|
||||
|
||||
def test_template_is_the_member_closest_to_the_cohort_median_size() -> None:
|
||||
# Arrange — the cohort spans a wide range of sizes; members[0] is an atypical
|
||||
# tiny 20 m² outlier. A single neighbour's geometry is copied wholesale, so
|
||||
# the template must be the size-representative member (closest to the median),
|
||||
# not whoever happens to come first (ADR-0029 decision 4: closest on size).
|
||||
cohort = _cohort(
|
||||
_epc(floor_area=20.0),
|
||||
_epc(floor_area=80.0),
|
||||
_epc(floor_area=200.0),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — the 80 m² member (the median) seeds the structure, not the 20 m²
|
||||
# outlier sitting at members[0].
|
||||
assert predicted.total_floor_area_m2 == 80.0
|
||||
|
||||
|
||||
def test_sets_main_wall_construction_to_the_cohort_mode() -> None:
|
||||
# Arrange — the template (members[0]) is solid brick (2), but the cohort
|
||||
# majority is cavity (1). The homogeneous categorical should follow the mode,
|
||||
# not the one template, so the prediction is robust to an atypical template.
|
||||
cohort = _cohort(
|
||||
_epc(wall_construction=2),
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — cavity (the mode) wins over the solid-brick template.
|
||||
assert predicted.sap_building_parts[0].wall_construction == 1
|
||||
|
||||
|
||||
def test_sets_the_other_homogeneous_categoricals_to_the_cohort_mode() -> None:
|
||||
# Arrange — the median-size template (members[0], 80 m²) is an atypical
|
||||
# outlier on every categorical; the cohort majority disagrees. Age band,
|
||||
# wall insulation, roof construction and floor construction are all
|
||||
# homogeneous categoricals, so each should follow its mode, not the one
|
||||
# template (ADR-0029 decision 4).
|
||||
cohort = _cohort(
|
||||
_epc(
|
||||
floor_area=80.0,
|
||||
construction_age_band="A",
|
||||
wall_insulation_type=9,
|
||||
roof_construction=7,
|
||||
floor_construction=7,
|
||||
),
|
||||
_epc(
|
||||
construction_age_band="K",
|
||||
wall_insulation_type=1,
|
||||
roof_construction=2,
|
||||
floor_construction=3,
|
||||
),
|
||||
_epc(
|
||||
construction_age_band="K",
|
||||
wall_insulation_type=1,
|
||||
roof_construction=2,
|
||||
floor_construction=3,
|
||||
),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — every categorical follows the cohort mode over the outlier
|
||||
# template.
|
||||
main = predicted.sap_building_parts[0]
|
||||
assert main.construction_age_band == "K"
|
||||
assert main.wall_insulation_type == 1
|
||||
assert main.roof_construction == 2
|
||||
assert main.sap_floor_dimensions[0].floor_construction == 3
|
||||
|
||||
|
||||
def test_modes_roof_and_floor_insulation() -> None:
|
||||
# Arrange — the median-size template (members[0]) is an outlier on roof
|
||||
# insulation thickness and floor insulation; the cohort majority disagrees.
|
||||
# These are independent fabric categoricals, so each should follow its
|
||||
# cohort mode like the construction categoricals do.
|
||||
cohort = _cohort(
|
||||
_epc(floor_area=80.0, roof_insulation_thickness=25, floor_insulation=9),
|
||||
_epc(roof_insulation_thickness=300, floor_insulation=2),
|
||||
_epc(roof_insulation_thickness=300, floor_insulation=2),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — each follows the cohort mode over the outlier template.
|
||||
main = predicted.sap_building_parts[0]
|
||||
assert main.roof_insulation_thickness == 300
|
||||
assert main.sap_floor_dimensions[0].floor_insulation == 2
|
||||
|
||||
|
||||
def test_recency_weights_roof_insulation_mode() -> None:
|
||||
# Arrange — an old majority (three 2015 certs at 100 mm) and a recent
|
||||
# minority (two 2025 certs at 300 mm). Roof insulation is topped up over
|
||||
# time, so the recent neighbours reflect the current state: the recency-
|
||||
# weighted mode must pick 300 over the plain-majority 100.
|
||||
cohort = _dated_cohort(
|
||||
(_epc(roof_insulation_thickness=100), date(2015, 1, 1)),
|
||||
(_epc(roof_insulation_thickness=100), date(2015, 1, 1)),
|
||||
(_epc(roof_insulation_thickness=100), date(2015, 1, 1)),
|
||||
(_epc(roof_insulation_thickness=300), date(2025, 1, 1)),
|
||||
(_epc(roof_insulation_thickness=300), date(2025, 1, 1)),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — recency overrides the stale majority.
|
||||
assert predicted.sap_building_parts[0].roof_insulation_thickness == 300
|
||||
|
||||
|
||||
def test_floor_area_is_the_cohort_median_not_the_templates_own_area() -> None:
|
||||
# Arrange — an even-sized cohort whose median (70) falls between members, so
|
||||
# the size-representative template (the first member closest to the median,
|
||||
# 60 m²) does not itself sit on the median. The predicted floor area is a
|
||||
# point estimate of the target's size, best served by the cohort median (the
|
||||
# MAD-minimising estimator), decoupled from whichever template seeds the
|
||||
# structure.
|
||||
cohort = _cohort(
|
||||
_epc(floor_area=40.0),
|
||||
_epc(floor_area=60.0),
|
||||
_epc(floor_area=80.0),
|
||||
_epc(floor_area=100.0),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — the floor area is the cohort median (70), not the template's 60.
|
||||
assert predicted.total_floor_area_m2 == 70.0
|
||||
|
||||
|
||||
def test_floor_area_leans_toward_the_nearest_neighbours_size() -> None:
|
||||
# Arrange — three FAR neighbours are 60 m²; one neighbour AT the target is
|
||||
# 120 m². The plain median would be 60, but homes built together share a
|
||||
# footprint, so the geo-proximity-weighted median leans toward the near
|
||||
# neighbour's size.
|
||||
here = Coordinates(longitude=0.0, latitude=0.0)
|
||||
far = Coordinates(longitude=1.0, latitude=1.0) # ~150 km away
|
||||
cohort = ComparableProperties(
|
||||
members=(
|
||||
ComparableProperty(_epc(floor_area=60.0), "1", coordinates=far),
|
||||
ComparableProperty(_epc(floor_area=60.0), "2", coordinates=far),
|
||||
ComparableProperty(_epc(floor_area=60.0), "3", coordinates=far),
|
||||
ComparableProperty(_epc(floor_area=120.0), "4", coordinates=here),
|
||||
)
|
||||
)
|
||||
target = PredictionTarget(
|
||||
postcode="LS6 1AA", property_type="2", coordinates=here
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(target, cohort)
|
||||
|
||||
# Assert — the near neighbour's size dominates the far majority.
|
||||
assert predicted.total_floor_area_m2 == 120.0
|
||||
|
||||
|
||||
def test_floor_area_median_is_unweighted_without_target_coordinates() -> None:
|
||||
# Arrange — identical cohort, but the target has no coordinates, so geo
|
||||
# weighting is off and the floor area reduces to the plain cohort median (60).
|
||||
here = Coordinates(longitude=0.0, latitude=0.0)
|
||||
far = Coordinates(longitude=1.0, latitude=1.0)
|
||||
cohort = ComparableProperties(
|
||||
members=(
|
||||
ComparableProperty(_epc(floor_area=60.0), "1", coordinates=far),
|
||||
ComparableProperty(_epc(floor_area=60.0), "2", coordinates=far),
|
||||
ComparableProperty(_epc(floor_area=60.0), "3", coordinates=far),
|
||||
ComparableProperty(_epc(floor_area=120.0), "4", coordinates=here),
|
||||
)
|
||||
)
|
||||
target = PredictionTarget(postcode="LS6 1AA", property_type="2")
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(target, cohort)
|
||||
|
||||
# Assert — without target coordinates, the plain median (60) wins.
|
||||
assert predicted.total_floor_area_m2 == 60.0
|
||||
|
||||
|
||||
def test_categorical_mode_leans_on_size_similar_neighbours() -> None:
|
||||
# Arrange — a count majority (three) carries wall-insulation 9, but two of
|
||||
# them are 400 m² size outliers; the cohort centre (median 100 m²) holds
|
||||
# wall-insulation 1. Physical-similarity weighting down-weights the outliers,
|
||||
# so the size-representative value 1 wins over the plain-count majority 9.
|
||||
cohort = _cohort(
|
||||
_epc(floor_area=100.0, wall_insulation_type=1),
|
||||
_epc(floor_area=100.0, wall_insulation_type=1),
|
||||
_epc(floor_area=100.0, wall_insulation_type=9),
|
||||
_epc(floor_area=400.0, wall_insulation_type=9),
|
||||
_epc(floor_area=400.0, wall_insulation_type=9),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — the size-similar value wins over the outlier-driven majority.
|
||||
assert predicted.sap_building_parts[0].wall_insulation_type == 1
|
||||
|
||||
|
||||
def test_categorical_mode_leans_on_age_similar_neighbours() -> None:
|
||||
# Arrange — same size throughout (so size weighting is neutral). A count
|
||||
# majority (three) carries wall-insulation 9, but two of them are age-band A
|
||||
# outliers while the cohort's modal band is K. Age-similarity weighting
|
||||
# down-weights the outliers, so the band-representative value 1 wins.
|
||||
cohort = _cohort(
|
||||
_epc(construction_age_band="K", wall_insulation_type=1),
|
||||
_epc(construction_age_band="K", wall_insulation_type=1),
|
||||
_epc(construction_age_band="K", wall_insulation_type=9),
|
||||
_epc(construction_age_band="A", wall_insulation_type=9),
|
||||
_epc(construction_age_band="A", wall_insulation_type=9),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — the age-similar value wins over the outlier-driven majority.
|
||||
assert predicted.sap_building_parts[0].wall_insulation_type == 1
|
||||
|
||||
|
||||
def test_confidence_reports_cohort_size_and_unanimous_agreement() -> None:
|
||||
# Arrange — a unanimous cohort: three neighbours, all cavity-walled (1).
|
||||
cohort = _cohort(
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
)
|
||||
|
||||
# Act
|
||||
confidence: PredictionConfidence = EpcPrediction().confidence(cohort)
|
||||
|
||||
# Assert — three neighbours, total agreement on the wall construction.
|
||||
assert confidence.cohort_size == 3
|
||||
assert confidence.agreement("wall_construction") == 1.0
|
||||
|
||||
|
||||
def test_confidence_agreement_is_the_modal_share_of_the_cohort() -> None:
|
||||
# Arrange — three of four neighbours are cavity (1), one is solid brick (2),
|
||||
# so the cohort is split on the wall construction.
|
||||
cohort = _cohort(
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=2),
|
||||
)
|
||||
|
||||
# Act
|
||||
confidence: PredictionConfidence = EpcPrediction().confidence(cohort)
|
||||
|
||||
# Assert — agreement is the modal value's share of the cohort: 3 of 4.
|
||||
share: Optional[float] = confidence.agreement("wall_construction")
|
||||
assert share is not None
|
||||
assert abs(share - 0.75) <= 1e-9
|
||||
|
||||
|
||||
def test_confidence_excludes_absent_component_values_from_the_denominator() -> None:
|
||||
# Arrange — two neighbours lodge a roof construction (both code 2); one lodges
|
||||
# none. The missing value must not dilute the agreement to 2/3.
|
||||
cohort = _cohort(
|
||||
_epc(roof_construction=2),
|
||||
_epc(roof_construction=2),
|
||||
_epc(roof_construction=None),
|
||||
)
|
||||
|
||||
# Act
|
||||
confidence: PredictionConfidence = EpcPrediction().confidence(cohort)
|
||||
|
||||
# Assert — agreement counts only the two present, unanimous values (1.0),
|
||||
# while the cohort size still reflects all three neighbours.
|
||||
share: Optional[float] = confidence.agreement("roof_construction")
|
||||
assert share is not None
|
||||
assert abs(share - 1.0) <= 1e-9
|
||||
assert confidence.cohort_size == 3
|
||||
|
||||
|
||||
def test_heating_is_a_coherent_donor_not_the_structural_template() -> None:
|
||||
# Arrange — the size-representative template (median 80 m²) runs an atypical
|
||||
# system (fuel 99, no cylinder), but the cohort's modal heating signature is a
|
||||
# gas system (fuel 1) with a cylinder, including a recent 2024 cert. Heating
|
||||
# sub-fields can't be field-moded, so the whole SapHeating cluster must be
|
||||
# copied from the coherent modal donor — the most recent among the matches —
|
||||
# not inherited from the structural template.
|
||||
cohort = _dated_cohort(
|
||||
(
|
||||
_epc(
|
||||
floor_area=80.0,
|
||||
main_fuel_type=99,
|
||||
main_heating_control=99,
|
||||
has_hot_water_cylinder=False,
|
||||
),
|
||||
date(2016, 1, 1),
|
||||
),
|
||||
(_epc(main_fuel_type=1, main_heating_control=5), date(2018, 1, 1)),
|
||||
(_epc(main_fuel_type=1, main_heating_control=5), date(2019, 1, 1)),
|
||||
(_epc(main_fuel_type=1, main_heating_control=7), date(2024, 1, 1)),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — heating comes coherently from the modal-signature donor (gas +
|
||||
# cylinder), the most recent match (control 7 from 2024), not the template's
|
||||
# fuel 99.
|
||||
detail = predicted.sap_heating.main_heating_details[0]
|
||||
assert detail.main_fuel_type == 1
|
||||
assert detail.main_heating_control == 7
|
||||
assert predicted.has_hot_water_cylinder is True
|
||||
|
||||
|
||||
def test_glazing_follows_the_recency_weighted_cohort_mode() -> None:
|
||||
# Arrange — an old majority single-glazed (type 1, 2015) and a recent
|
||||
# minority double-glazed (type 3, 2025). Glazing is retrofitted over time
|
||||
# (single → double), so the recent neighbours reflect the current state: the
|
||||
# recency-weighted mode must pick double over the stale single-glazed
|
||||
# majority, like roof insulation thickness.
|
||||
cohort = _dated_cohort(
|
||||
(_epc(glazing_type=1), date(2015, 1, 1)),
|
||||
(_epc(glazing_type=1), date(2015, 1, 1)),
|
||||
(_epc(glazing_type=1), date(2015, 1, 1)),
|
||||
(_epc(glazing_type=3), date(2025, 1, 1)),
|
||||
(_epc(glazing_type=3), date(2025, 1, 1)),
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(
|
||||
PredictionTarget(postcode="LS6 1AA", property_type="2"), cohort
|
||||
)
|
||||
|
||||
# Assert — every predicted window takes the recent glazing over the majority.
|
||||
assert all(window.glazing_type == 3 for window in predicted.sap_windows)
|
||||
|
||||
|
||||
def test_geo_proximity_weights_the_nearest_neighbour() -> None:
|
||||
# Arrange — same size + age (so similarity weighting is uniform). Three FAR
|
||||
# neighbours are cavity (1); one neighbour AT the target is solid brick (2).
|
||||
# wall construction is a geo-weighted component, so the near neighbour
|
||||
# outweighs the far majority.
|
||||
here = Coordinates(longitude=0.0, latitude=0.0)
|
||||
far = Coordinates(longitude=1.0, latitude=1.0) # ~150 km away
|
||||
cohort = ComparableProperties(
|
||||
members=(
|
||||
ComparableProperty(_epc(wall_construction=1), "1", coordinates=far),
|
||||
ComparableProperty(_epc(wall_construction=1), "2", coordinates=far),
|
||||
ComparableProperty(_epc(wall_construction=1), "3", coordinates=far),
|
||||
ComparableProperty(_epc(wall_construction=2), "4", coordinates=here),
|
||||
)
|
||||
)
|
||||
target = PredictionTarget(
|
||||
postcode="LS6 1AA", property_type="2", coordinates=here
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(target, cohort)
|
||||
|
||||
# Assert — the near neighbour's wall wins over the far majority.
|
||||
assert predicted.sap_building_parts[0].wall_construction == 2
|
||||
|
||||
|
||||
def test_geo_proximity_is_off_without_target_coordinates() -> None:
|
||||
# Arrange — identical cohort, but the target has no coordinates, so geo
|
||||
# weighting is disabled and the plain cohort majority (cavity, 1) wins.
|
||||
here = Coordinates(longitude=0.0, latitude=0.0)
|
||||
far = Coordinates(longitude=1.0, latitude=1.0)
|
||||
cohort = ComparableProperties(
|
||||
members=(
|
||||
ComparableProperty(_epc(wall_construction=1), "1", coordinates=far),
|
||||
ComparableProperty(_epc(wall_construction=1), "2", coordinates=far),
|
||||
ComparableProperty(_epc(wall_construction=1), "3", coordinates=far),
|
||||
ComparableProperty(_epc(wall_construction=2), "4", coordinates=here),
|
||||
)
|
||||
)
|
||||
target = PredictionTarget(postcode="LS6 1AA", property_type="2")
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(target, cohort)
|
||||
|
||||
# Assert — without target coordinates, the majority wins (geo off).
|
||||
assert predicted.sap_building_parts[0].wall_construction == 1
|
||||
|
||||
|
||||
def test_applies_a_known_wall_override_over_the_mode() -> None:
|
||||
# Arrange — the cohort mode is cavity (1), but we KNOW the target is solid
|
||||
# brick (2), a Landlord Override. The known value must win over the estimate.
|
||||
cohort = _cohort(
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
_epc(wall_construction=1),
|
||||
)
|
||||
target = PredictionTarget(
|
||||
postcode="LS6 1AA", property_type="2", wall_construction=2
|
||||
)
|
||||
|
||||
# Act
|
||||
predicted: EpcPropertyData = EpcPrediction().predict(target, cohort)
|
||||
|
||||
# Assert — the known override overrides the cohort mode.
|
||||
assert predicted.sap_building_parts[0].wall_construction == 2
|
||||
365
tests/domain/epc_prediction/test_prediction_comparison.py
Normal file
365
tests/domain/epc_prediction/test_prediction_comparison.py
Normal file
|
|
@ -0,0 +1,365 @@
|
|||
"""Behaviour of the per-Property prediction comparison (ADR-0029): given a
|
||||
predicted EpcPropertyData and the actual one, report the accuracy signals the
|
||||
validation harness aggregates — classification matches on the key categoricals
|
||||
and residuals on the geometry. Pure; SAP residual is computed in the runner
|
||||
(it needs the calculator + lodged SAP).
|
||||
"""
|
||||
|
||||
from typing import Optional, Union
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import (
|
||||
EpcPropertyData,
|
||||
MainHeatingDetail,
|
||||
PhotovoltaicSupply,
|
||||
SapBuildingPart,
|
||||
SapEnergySource,
|
||||
SapFloorDimension,
|
||||
SapHeating,
|
||||
SapRoomInRoof,
|
||||
SapWindow,
|
||||
)
|
||||
from domain.epc_prediction.prediction_comparison import compare_prediction
|
||||
|
||||
|
||||
def _epc(
|
||||
*,
|
||||
wall_construction: int = 1,
|
||||
wall_insulation_type: Union[int, str] = 1,
|
||||
construction_age_band: str = "K",
|
||||
roof_construction: Optional[int] = 1,
|
||||
roof_insulation_thickness: Optional[Union[str, int]] = 100,
|
||||
floor_construction: Optional[int] = 1,
|
||||
floor_insulation: Optional[int] = 1,
|
||||
has_room_in_roof: bool = False,
|
||||
floor_area: float = 80.0,
|
||||
building_parts: int = 1,
|
||||
windows: Optional[list[tuple[float, float]]] = None,
|
||||
glazing_type: Union[int, str] = 3,
|
||||
door_count: int = 2,
|
||||
has_pv: bool = False,
|
||||
solar_water_heating: bool = False,
|
||||
main_fuel_type: Union[int, str] = 20,
|
||||
main_heating_category: Optional[int] = 2,
|
||||
main_heating_control: Union[int, str] = 2100,
|
||||
water_heating_fuel: Optional[int] = 20,
|
||||
water_heating_code: Optional[int] = 901,
|
||||
has_hot_water_cylinder: bool = True,
|
||||
cylinder_insulation_type: Optional[Union[int, str]] = 1,
|
||||
secondary_heating_type: Optional[Union[int, str]] = None,
|
||||
) -> EpcPropertyData:
|
||||
epc: EpcPropertyData = object.__new__(EpcPropertyData)
|
||||
epc.total_floor_area_m2 = floor_area
|
||||
epc.door_count = door_count
|
||||
epc.solar_water_heating = solar_water_heating
|
||||
parts: list[SapBuildingPart] = []
|
||||
for _ in range(building_parts):
|
||||
part: SapBuildingPart = object.__new__(SapBuildingPart)
|
||||
part.wall_construction = wall_construction
|
||||
part.wall_insulation_type = wall_insulation_type
|
||||
part.construction_age_band = construction_age_band
|
||||
part.roof_construction = roof_construction
|
||||
part.roof_insulation_thickness = roof_insulation_thickness
|
||||
part.sap_room_in_roof = (
|
||||
object.__new__(SapRoomInRoof) if has_room_in_roof else None
|
||||
)
|
||||
floor_dim: SapFloorDimension = object.__new__(SapFloorDimension)
|
||||
floor_dim.floor_construction = floor_construction
|
||||
floor_dim.floor_insulation = floor_insulation
|
||||
part.sap_floor_dimensions = [floor_dim]
|
||||
parts.append(part)
|
||||
epc.sap_building_parts = parts
|
||||
detail: MainHeatingDetail = object.__new__(MainHeatingDetail)
|
||||
detail.main_fuel_type = main_fuel_type
|
||||
detail.main_heating_category = main_heating_category
|
||||
detail.main_heating_control = main_heating_control
|
||||
heating: SapHeating = object.__new__(SapHeating)
|
||||
heating.main_heating_details = [detail]
|
||||
heating.water_heating_fuel = water_heating_fuel
|
||||
heating.water_heating_code = water_heating_code
|
||||
heating.cylinder_insulation_type = cylinder_insulation_type
|
||||
heating.secondary_heating_type = secondary_heating_type
|
||||
epc.sap_heating = heating
|
||||
epc.has_hot_water_cylinder = has_hot_water_cylinder
|
||||
sap_windows: list[SapWindow] = []
|
||||
for width, height in windows or []:
|
||||
w: SapWindow = object.__new__(SapWindow)
|
||||
w.window_width = width
|
||||
w.window_height = height
|
||||
w.glazing_type = glazing_type
|
||||
sap_windows.append(w)
|
||||
epc.sap_windows = sap_windows
|
||||
energy: SapEnergySource = object.__new__(SapEnergySource)
|
||||
energy.photovoltaic_supply = (
|
||||
object.__new__(PhotovoltaicSupply) if has_pv else None
|
||||
)
|
||||
energy.photovoltaic_arrays = None
|
||||
epc.sap_energy_source = energy
|
||||
return epc
|
||||
|
||||
|
||||
def test_scores_age_band_within_one_band() -> None:
|
||||
# Arrange — predicted age band K, actual J (adjacent). Adjacent RdSAP age
|
||||
# bands carry near-identical U-values, so an off-by-one is ~SAP-neutral: it
|
||||
# misses the exact hit but counts as a ±1-band hit (issue #1222).
|
||||
predicted = _epc(construction_age_band="K")
|
||||
actual = _epc(construction_age_band="J")
|
||||
|
||||
# Act
|
||||
hits = compare_prediction(predicted, actual).categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["construction_age_band"] is False
|
||||
assert hits["construction_age_band_pm1"] is True
|
||||
|
||||
|
||||
def test_age_band_two_apart_misses_both() -> None:
|
||||
# Arrange — predicted K, actual H (three bands apart): a real miss on both.
|
||||
predicted = _epc(construction_age_band="K")
|
||||
actual = _epc(construction_age_band="H")
|
||||
|
||||
# Act
|
||||
hits = compare_prediction(predicted, actual).categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["construction_age_band"] is False
|
||||
assert hits["construction_age_band_pm1"] is False
|
||||
|
||||
|
||||
def test_scores_roof_insulation_within_one_bucket() -> None:
|
||||
# Arrange — predicted 250mm, actual 270mm (adjacent RdSAP buckets). Adjacent
|
||||
# thicknesses carry near-identical roof U-values, so it misses the exact hit
|
||||
# but counts as a ±1-bucket hit, like the age band (issue #1222).
|
||||
predicted = _epc(roof_insulation_thickness="250mm")
|
||||
actual = _epc(roof_insulation_thickness="270mm")
|
||||
|
||||
# Act
|
||||
hits = compare_prediction(predicted, actual).categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["roof_insulation_thickness"] is False
|
||||
assert hits["roof_insulation_thickness_pm1"] is True
|
||||
|
||||
|
||||
def test_roof_insulation_two_buckets_apart_misses_both() -> None:
|
||||
# Arrange — predicted 100mm, actual 200mm (three buckets apart: 100/150/200):
|
||||
# a real miss on both exact and ±1.
|
||||
predicted = _epc(roof_insulation_thickness="100mm")
|
||||
actual = _epc(roof_insulation_thickness="200mm")
|
||||
|
||||
# Act
|
||||
hits = compare_prediction(predicted, actual).categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["roof_insulation_thickness"] is False
|
||||
assert hits["roof_insulation_thickness_pm1"] is False
|
||||
|
||||
|
||||
def test_roof_insulation_off_scale_no_data_only_exact_counts() -> None:
|
||||
# Arrange — actual is the off-scale "ND" (no-data) category; a non-equal
|
||||
# prediction can't be an adjacent-bucket hit.
|
||||
predicted = _epc(roof_insulation_thickness="200mm")
|
||||
actual = _epc(roof_insulation_thickness="ND")
|
||||
|
||||
# Act
|
||||
hits = compare_prediction(predicted, actual).categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["roof_insulation_thickness"] is False
|
||||
assert hits["roof_insulation_thickness_pm1"] is False
|
||||
|
||||
|
||||
def test_flags_a_correct_main_wall_construction_classification() -> None:
|
||||
# Arrange — predicted and actual agree on cavity (1).
|
||||
predicted = _epc(wall_construction=1)
|
||||
actual = _epc(wall_construction=1)
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
|
||||
# Assert
|
||||
assert comparison.categorical_hits["wall_construction"] is True
|
||||
|
||||
|
||||
def test_flags_an_incorrect_main_wall_construction_classification() -> None:
|
||||
# Arrange — predicted cavity (1), actual solid brick (2).
|
||||
predicted = _epc(wall_construction=1)
|
||||
actual = _epc(wall_construction=2)
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
|
||||
# Assert
|
||||
assert comparison.categorical_hits["wall_construction"] is False
|
||||
|
||||
|
||||
def test_classifies_the_extra_homogeneous_categoricals() -> None:
|
||||
# Arrange — predicted agrees on age band, wall insulation, roof and floor
|
||||
# construction with the actual; only wall insulation differs.
|
||||
predicted = _epc(
|
||||
construction_age_band="K",
|
||||
wall_insulation_type=2,
|
||||
roof_construction=3,
|
||||
floor_construction=1,
|
||||
)
|
||||
actual = _epc(
|
||||
construction_age_band="K",
|
||||
wall_insulation_type=1,
|
||||
roof_construction=3,
|
||||
floor_construction=1,
|
||||
)
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
|
||||
# Assert
|
||||
assert comparison.categorical_hits["construction_age_band"] is True
|
||||
assert comparison.categorical_hits["wall_insulation_type"] is False
|
||||
assert comparison.categorical_hits["roof_construction"] is True
|
||||
assert comparison.categorical_hits["floor_construction"] is True
|
||||
|
||||
|
||||
def test_classifies_the_heating_components() -> None:
|
||||
# Arrange — predicted and actual agree on everything heating except the main
|
||||
# fuel (predicted oil 28, actual gas 20) and secondary heating (predicted
|
||||
# none, actual a wood stove 693). Heating is the dominant SAP lever, so each
|
||||
# heating component is scored (ADR-0030 Component Accuracy).
|
||||
predicted = _epc(
|
||||
main_fuel_type=28,
|
||||
main_heating_category=2,
|
||||
main_heating_control=2100,
|
||||
water_heating_fuel=20,
|
||||
water_heating_code=901,
|
||||
has_hot_water_cylinder=True,
|
||||
cylinder_insulation_type=1,
|
||||
secondary_heating_type=None,
|
||||
)
|
||||
actual = _epc(
|
||||
main_fuel_type=20,
|
||||
main_heating_category=2,
|
||||
main_heating_control=2100,
|
||||
water_heating_fuel=20,
|
||||
water_heating_code=901,
|
||||
has_hot_water_cylinder=True,
|
||||
cylinder_insulation_type=1,
|
||||
secondary_heating_type=693,
|
||||
)
|
||||
|
||||
# Act
|
||||
hits = compare_prediction(predicted, actual).categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["heating_main_fuel"] is False
|
||||
assert hits["heating_main_category"] is True
|
||||
assert hits["heating_main_control"] is True
|
||||
assert hits["water_heating_fuel"] is True
|
||||
assert hits["water_heating_code"] is True
|
||||
assert hits["has_hot_water_cylinder"] is True
|
||||
assert hits["cylinder_insulation_type"] is True
|
||||
# Secondary heating is absent in the prediction but present in the actual —
|
||||
# a real miss (predicted None ≠ actual 693), not "not applicable".
|
||||
assert hits["secondary_heating_type"] is False
|
||||
|
||||
|
||||
def test_classifies_fabric_insulation_and_room_in_roof() -> None:
|
||||
# Arrange — predicted and actual disagree on roof insulation thickness and on
|
||||
# whether there's a room-in-roof, but agree on floor insulation.
|
||||
predicted = _epc(
|
||||
roof_insulation_thickness=100,
|
||||
floor_insulation=1,
|
||||
has_room_in_roof=False,
|
||||
)
|
||||
actual = _epc(
|
||||
roof_insulation_thickness=270,
|
||||
floor_insulation=1,
|
||||
has_room_in_roof=True,
|
||||
)
|
||||
|
||||
# Act
|
||||
hits = compare_prediction(predicted, actual).categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["roof_insulation_thickness"] is False
|
||||
assert hits["floor_insulation"] is True
|
||||
# Room-in-roof presence is always applicable — predicting "no RR" when there
|
||||
# is one is a real miss, not "not applicable".
|
||||
assert hits["has_room_in_roof"] is False
|
||||
|
||||
|
||||
def test_classifies_glazing_renewables_and_door_count() -> None:
|
||||
# Arrange — predicted glazing type, PV and solar disagree with the actual;
|
||||
# door count is over-predicted by one.
|
||||
predicted = _epc(
|
||||
windows=[(1.0, 1.0), (1.0, 1.0)],
|
||||
glazing_type=3,
|
||||
has_pv=False,
|
||||
solar_water_heating=False,
|
||||
door_count=3,
|
||||
)
|
||||
actual = _epc(
|
||||
windows=[(1.0, 1.0), (1.0, 1.0)],
|
||||
glazing_type=4,
|
||||
has_pv=True,
|
||||
solar_water_heating=True,
|
||||
door_count=2,
|
||||
)
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
hits = comparison.categorical_hits
|
||||
|
||||
# Assert
|
||||
assert hits["modal_glazing_type"] is False
|
||||
assert hits["has_pv"] is False
|
||||
assert hits["solar_water_heating"] is False
|
||||
assert comparison.door_count_residual == 1
|
||||
|
||||
|
||||
def test_categorical_hit_is_not_applicable_when_actual_is_absent() -> None:
|
||||
# Arrange — the actual lodges no roof construction (a flat under another
|
||||
# dwelling). A hit there is not applicable, not a free win, so it must not
|
||||
# count towards the roof classification rate.
|
||||
predicted = _epc(roof_construction=3)
|
||||
actual = _epc(roof_construction=None)
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
|
||||
# Assert
|
||||
assert comparison.categorical_hits["roof_construction"] is None
|
||||
|
||||
|
||||
def test_reports_the_floor_area_residual_as_predicted_minus_actual() -> None:
|
||||
# Arrange — predicted 90 m², actual 100 m² (a 10 m² under-prediction).
|
||||
predicted = _epc(floor_area=90.0)
|
||||
actual = _epc(floor_area=100.0)
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
|
||||
# Assert — signed residual, predicted − actual.
|
||||
assert abs(comparison.floor_area_residual - (-10.0)) <= 1e-9
|
||||
|
||||
|
||||
def test_reports_the_building_parts_count_residual() -> None:
|
||||
# Arrange — predicted a single part; the actual has a main + an extension.
|
||||
predicted = _epc(building_parts=1)
|
||||
actual = _epc(building_parts=2)
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
|
||||
# Assert — predicted − actual.
|
||||
assert comparison.building_parts_residual == -1
|
||||
|
||||
|
||||
def test_reports_window_count_and_total_area_residuals() -> None:
|
||||
# Arrange — predicted 2 windows (3 m² total); actual 1 window (1 m²).
|
||||
predicted = _epc(windows=[(1.0, 1.0), (2.0, 1.0)])
|
||||
actual = _epc(windows=[(1.0, 1.0)])
|
||||
|
||||
# Act
|
||||
comparison = compare_prediction(predicted, actual)
|
||||
|
||||
# Assert
|
||||
assert comparison.window_count_residual == 1
|
||||
assert abs(comparison.total_window_area_residual - 2.0) <= 1e-9
|
||||
56
tests/domain/epc_prediction/test_prediction_target.py
Normal file
56
tests/domain/epc_prediction/test_prediction_target.py
Normal file
|
|
@ -0,0 +1,56 @@
|
|||
"""Assembling an EPC-less Property's PredictionTarget, and the eligibility gate:
|
||||
a Property whose property type is unknown is not predicted (ADR-0031 slice-5d)."""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
from typing import Optional
|
||||
|
||||
from domain.epc_prediction.prediction_target import (
|
||||
PredictionTarget,
|
||||
PredictionTargetAttributes,
|
||||
build_prediction_target,
|
||||
)
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
from domain.property.property import PropertyIdentity
|
||||
|
||||
|
||||
def _identity(postcode: str = "LS6 1AA") -> PropertyIdentity:
|
||||
return PropertyIdentity(
|
||||
portfolio_id=1, postcode=postcode, address="1 Some Street", uprn=12345
|
||||
)
|
||||
|
||||
|
||||
def test_target_is_assembled_from_identity_coords_and_overrides() -> None:
|
||||
# Arrange — a known property type + built form + wall (Landlord Overrides),
|
||||
# and resolved coordinates.
|
||||
here = Coordinates(longitude=-1.55, latitude=53.81)
|
||||
attributes = PredictionTargetAttributes(
|
||||
property_type="2", built_form="3", wall_construction=1
|
||||
)
|
||||
|
||||
# Act
|
||||
target: Optional[PredictionTarget] = build_prediction_target(
|
||||
_identity(), here, attributes
|
||||
)
|
||||
|
||||
# Assert — every known input is threaded onto the target.
|
||||
assert target is not None
|
||||
assert target.postcode == "LS6 1AA"
|
||||
assert target.property_type == "2"
|
||||
assert target.built_form == "3"
|
||||
assert target.wall_construction == 1
|
||||
assert target.coordinates is here
|
||||
|
||||
|
||||
def test_an_unknown_property_type_gates_the_property_out() -> None:
|
||||
# Arrange — property type is the hard cohort filter; without it the Property
|
||||
# must not be predicted from a mixed-type cohort (ADR-0031).
|
||||
attributes = PredictionTargetAttributes(property_type=None)
|
||||
|
||||
# Act
|
||||
target: Optional[PredictionTarget] = build_prediction_target(
|
||||
_identity(), None, attributes
|
||||
)
|
||||
|
||||
# Assert — gated out: no target to predict from.
|
||||
assert target is None
|
||||
123
tests/domain/epc_prediction/test_validation.py
Normal file
123
tests/domain/epc_prediction/test_validation.py
Normal file
|
|
@ -0,0 +1,123 @@
|
|||
"""Behaviour of the Component Accuracy leave-one-out scorer (ADR-0030): given
|
||||
loaded postcode cohorts, hold out each SAP 10.2 target, predict it from its
|
||||
all-vintage neighbours, and aggregate the per-component hits + residuals. Pure
|
||||
(no IO, no calculator) — corpus loading is the caller's job.
|
||||
"""
|
||||
|
||||
from datetime import date
|
||||
from typing import Optional, Union
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import (
|
||||
EpcPropertyData,
|
||||
MainHeatingDetail,
|
||||
SapBuildingPart,
|
||||
SapEnergySource,
|
||||
SapFloorDimension,
|
||||
SapHeating,
|
||||
)
|
||||
from domain.epc_prediction.comparable_properties import ComparableProperty
|
||||
from domain.epc_prediction.validation import evaluate_component_accuracy
|
||||
|
||||
|
||||
def _comparable(
|
||||
*,
|
||||
certificate_number: str,
|
||||
address: str,
|
||||
sap_version: float,
|
||||
wall_construction: Union[int, str] = 1,
|
||||
registration_date: Optional[date] = None,
|
||||
) -> ComparableProperty:
|
||||
"""A ComparableProperty carrying a fully-populated opaque EpcPropertyData — every
|
||||
field the predictor + comparison read (the partial-instance idiom)."""
|
||||
epc: EpcPropertyData = object.__new__(EpcPropertyData)
|
||||
epc.sap_version = sap_version
|
||||
epc.postcode = "LS6 1AA"
|
||||
epc.property_type = "2"
|
||||
epc.built_form = "4"
|
||||
epc.total_floor_area_m2 = 80.0
|
||||
epc.door_count = 2
|
||||
epc.solar_water_heating = False
|
||||
epc.has_hot_water_cylinder = True
|
||||
part: SapBuildingPart = object.__new__(SapBuildingPart)
|
||||
part.wall_construction = wall_construction
|
||||
part.wall_insulation_type = 1
|
||||
part.construction_age_band = "K"
|
||||
part.roof_construction = 1
|
||||
part.roof_insulation_thickness = 100
|
||||
part.sap_room_in_roof = None
|
||||
floor_dim: SapFloorDimension = object.__new__(SapFloorDimension)
|
||||
floor_dim.floor_construction = 1
|
||||
floor_dim.floor_insulation = 1
|
||||
part.sap_floor_dimensions = [floor_dim]
|
||||
epc.sap_building_parts = [part]
|
||||
epc.sap_windows = []
|
||||
detail: MainHeatingDetail = object.__new__(MainHeatingDetail)
|
||||
detail.main_fuel_type = 20
|
||||
detail.main_heating_category = 2
|
||||
detail.main_heating_control = 2100
|
||||
heating: SapHeating = object.__new__(SapHeating)
|
||||
heating.main_heating_details = [detail]
|
||||
heating.water_heating_fuel = 20
|
||||
heating.water_heating_code = 901
|
||||
heating.cylinder_insulation_type = 1
|
||||
heating.secondary_heating_type = None
|
||||
epc.sap_heating = heating
|
||||
energy: SapEnergySource = object.__new__(SapEnergySource)
|
||||
energy.photovoltaic_supply = None
|
||||
energy.photovoltaic_arrays = None
|
||||
epc.sap_energy_source = energy
|
||||
return ComparableProperty(
|
||||
epc=epc,
|
||||
certificate_number=certificate_number,
|
||||
address=address,
|
||||
registration_date=registration_date,
|
||||
)
|
||||
|
||||
|
||||
def test_scores_only_sap_10_2_targets() -> None:
|
||||
# Arrange — a cohort of two distinct addresses: one SAP 10.2, one older
|
||||
# (SAP 9.94). Only the 10.2 cert is a valid held-out target; the older one
|
||||
# is kept as source evidence (its components are still valid).
|
||||
cohort = [
|
||||
_comparable(
|
||||
certificate_number="A", address="1 THE ROW", sap_version=10.2
|
||||
),
|
||||
_comparable(
|
||||
certificate_number="B", address="2 THE ROW", sap_version=9.94
|
||||
),
|
||||
]
|
||||
|
||||
# Act
|
||||
accuracy = evaluate_component_accuracy([cohort])
|
||||
|
||||
# Assert — exactly one target scored (the 10.2 cert), predicted from the
|
||||
# older neighbour; the older cert was never held out.
|
||||
assert accuracy.targets == 1
|
||||
assert accuracy.rate("wall_construction") == 1.0
|
||||
|
||||
|
||||
def test_aggregates_a_wall_classification_miss() -> None:
|
||||
# Arrange — the 10.2 target is solid brick (2); its only neighbour (the
|
||||
# source) is cavity (1), so the predicted mode misses the wall.
|
||||
cohort = [
|
||||
_comparable(
|
||||
certificate_number="A",
|
||||
address="1 THE ROW",
|
||||
sap_version=10.2,
|
||||
wall_construction=2,
|
||||
),
|
||||
_comparable(
|
||||
certificate_number="B",
|
||||
address="2 THE ROW",
|
||||
sap_version=10.2,
|
||||
wall_construction=1,
|
||||
),
|
||||
]
|
||||
|
||||
# Act
|
||||
accuracy = evaluate_component_accuracy([cohort])
|
||||
|
||||
# Assert — both are 10.2 targets, and each is predicted from the other (the
|
||||
# opposite wall), so wall_construction is missed both times.
|
||||
assert accuracy.targets == 2
|
||||
assert accuracy.rate("wall_construction") == 0.0
|
||||
|
|
@ -98,6 +98,44 @@ def test_effective_epc_follows_the_selected_source_path() -> None:
|
|||
assert epc_property.effective_epc is public_epc
|
||||
|
||||
|
||||
def test_source_path_is_predicted_when_only_a_predicted_epc_is_present() -> None:
|
||||
# Arrange — no lodged EPC, no Site Notes; just a neighbour-synthesised picture
|
||||
# (EPC Prediction gap-fill, ADR-0031).
|
||||
predicted = _epc()
|
||||
prop = Property(identity=_identity(), predicted_epc=predicted)
|
||||
|
||||
# Act / Assert — predicted is the last-resort source, not a raise
|
||||
assert prop.source_path == "predicted"
|
||||
assert prop.effective_epc is predicted
|
||||
|
||||
|
||||
def test_a_lodged_epc_wins_over_a_predicted_epc() -> None:
|
||||
# Arrange — both a real lodged EPC and a neighbour-synthesised one are present;
|
||||
# the real source must win (prediction is last-resort only, ADR-0031).
|
||||
lodged = _epc()
|
||||
predicted = _epc()
|
||||
prop = Property(identity=_identity(), epc=lodged, predicted_epc=predicted)
|
||||
|
||||
# Act / Assert
|
||||
assert prop.source_path == "epc_with_overlay"
|
||||
assert prop.effective_epc is lodged
|
||||
|
||||
|
||||
def test_site_notes_win_over_a_predicted_epc() -> None:
|
||||
# Arrange — Site Notes and a predicted EPC are present; the survey wins.
|
||||
survey_epc = _epc()
|
||||
predicted = _epc()
|
||||
prop = Property(
|
||||
identity=_identity(),
|
||||
site_notes=SiteNotes(surveyed_at=date(2024, 6, 1), epc=survey_epc),
|
||||
predicted_epc=predicted,
|
||||
)
|
||||
|
||||
# Act / Assert
|
||||
assert prop.source_path == "site_notes"
|
||||
assert prop.effective_epc is survey_epc
|
||||
|
||||
|
||||
def test_property_with_no_source_raises() -> None:
|
||||
# Arrange
|
||||
prop = Property(identity=_identity())
|
||||
|
|
|
|||
0
tests/e2e/__init__.py
Normal file
0
tests/e2e/__init__.py
Normal file
177
tests/e2e/test_epc_prediction_e2e.py
Normal file
177
tests/e2e/test_epc_prediction_e2e.py
Normal file
|
|
@ -0,0 +1,177 @@
|
|||
"""END-TO-END showcase: an EPC-less Property flows through Ingestion, gets a
|
||||
predicted EPC synthesised from its postcode cohort, is persisted to the predicted
|
||||
slot, and comes back out of the Property repository resolving as the Effective
|
||||
EPC (ADR-0031).
|
||||
|
||||
This is the full production path with ONLY the external HTTP clients faked (the
|
||||
EPC API, the geospatial S3 reader, the Solar API) — everything else is the real
|
||||
thing: the real Postgres Unit of Work, the real EPC + Property repositories
|
||||
against the test database, the real `EpcComparablePropertiesRepository`, and the
|
||||
real `EpcPrediction`. It is the canonical "see the whole flow" reference; the
|
||||
narrower unit tests live in:
|
||||
- tests/orchestration/test_ingestion_prediction.py (orchestrator: gate / persist)
|
||||
- tests/repositories/epc/test_epc_predicted_slot.py (the lodged|predicted slot)
|
||||
- tests/domain/property/test_property.py (the "predicted" source path)
|
||||
- tests/domain/epc_prediction/test_prediction_target.py (the eligibility gate)
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
from typing import Any, Optional
|
||||
|
||||
from sqlalchemy import Engine
|
||||
from sqlmodel import Session
|
||||
|
||||
from datatypes.epc.domain.epc_property_data import EpcPropertyData
|
||||
from datatypes.epc.domain.mapper import EpcPropertyDataMapper
|
||||
from datatypes.epc.search.epc_search_result import EpcSearchResult
|
||||
from domain.epc_prediction.epc_prediction import EpcPrediction
|
||||
from domain.epc_prediction.prediction_target import PredictionTargetAttributes
|
||||
from domain.geospatial.coordinates import Coordinates
|
||||
from domain.geospatial.planning_restrictions import PlanningRestrictions
|
||||
from domain.geospatial.spatial_reference import SpatialReference
|
||||
from domain.property.property import Property
|
||||
from infrastructure.postgres.property_table import PropertyRow
|
||||
from orchestration.ingestion_orchestrator import IngestionOrchestrator
|
||||
from repositories.comparable_properties.epc_comparable_properties_repository import (
|
||||
EpcComparablePropertiesRepository,
|
||||
)
|
||||
from repositories.epc.epc_postgres_repository import EpcPostgresRepository
|
||||
from repositories.geospatial.geospatial_repository import GeospatialRepository
|
||||
from repositories.postgres_unit_of_work import PostgresUnitOfWork
|
||||
from repositories.property.property_postgres_repository import (
|
||||
PropertyPostgresRepository,
|
||||
)
|
||||
from repositories.spatial.spatial_postgres_repository import SpatialPostgresRepository
|
||||
|
||||
_JSON_SAMPLES = Path(__file__).resolve().parents[2] / "backend/epc_api/json_samples"
|
||||
_POSTCODE = "LS6 1AA"
|
||||
|
||||
|
||||
def _epc() -> EpcPropertyData:
|
||||
raw: dict[str, Any] = json.loads(
|
||||
(_JSON_SAMPLES / "RdSAP-Schema-21.0.0" / "epc.json").read_text()
|
||||
)
|
||||
return EpcPropertyDataMapper.from_api_response(raw)
|
||||
|
||||
|
||||
# --- fakes for the THREE external HTTP boundaries (everything else is real) ----
|
||||
|
||||
|
||||
class _FakeCohortEpcClient:
|
||||
"""Stands in for the live EPC API: the postcode's lodged certs + their data."""
|
||||
|
||||
def __init__(self, results: list[EpcSearchResult]) -> None:
|
||||
self._results = results
|
||||
|
||||
def search_by_postcode(self, postcode: str) -> list[EpcSearchResult]:
|
||||
return self._results
|
||||
|
||||
def get_by_certificate_number(self, cert_num: str) -> EpcPropertyData:
|
||||
return _epc()
|
||||
|
||||
|
||||
class _FakeGeospatialRepo(GeospatialRepository):
|
||||
"""Stands in for the S3 Open-UPRN reader: UPRN → coordinates."""
|
||||
|
||||
def __init__(self, coords: dict[int, Coordinates]) -> None:
|
||||
self._coords = coords
|
||||
|
||||
def coordinates_for(self, uprn: int) -> Optional[Coordinates]:
|
||||
return self._coords.get(uprn)
|
||||
|
||||
def spatial_for(self, uprn: int) -> Optional[SpatialReference]:
|
||||
coordinates = self._coords.get(uprn)
|
||||
if coordinates is None:
|
||||
return None
|
||||
return SpatialReference(
|
||||
coordinates=coordinates, restrictions=PlanningRestrictions()
|
||||
)
|
||||
|
||||
|
||||
class _NoEpcFetcher:
|
||||
"""The target Property is EPC-less — the EPC API finds nothing for its UPRN."""
|
||||
|
||||
def get_by_uprn(self, uprn: int) -> Optional[EpcPropertyData]:
|
||||
return None
|
||||
|
||||
|
||||
class _NoSolarFetcher:
|
||||
def get_building_insights(
|
||||
self, longitude: float, latitude: float
|
||||
) -> dict[str, Any]:
|
||||
return {}
|
||||
|
||||
|
||||
class _FakeAttributesReader:
|
||||
"""Stands in for Jun-te's property_overrides read adapter: the landlord-known
|
||||
property type (here a House, code "0", matching the cohort)."""
|
||||
|
||||
def attributes_for(self, property_id: int) -> PredictionTargetAttributes:
|
||||
return PredictionTargetAttributes(property_type="0", built_form="2")
|
||||
|
||||
|
||||
def _cohort_results() -> list[EpcSearchResult]:
|
||||
return [
|
||||
EpcSearchResult(
|
||||
certificate_number=f"CERT-{i}",
|
||||
address_line_1=f"{i} Neighbour Road",
|
||||
address_line_2=None,
|
||||
address_line_3=None,
|
||||
address_line_4=None,
|
||||
postcode=_POSTCODE,
|
||||
post_town="LEEDS",
|
||||
uprn=20000 + i,
|
||||
current_energy_efficiency_band="D",
|
||||
registration_date=f"2023-0{i + 1}-01",
|
||||
)
|
||||
for i in range(3)
|
||||
]
|
||||
|
||||
|
||||
def test_epc_less_property_is_predicted_persisted_and_resolved_end_to_end(
|
||||
db_engine: Engine,
|
||||
) -> None:
|
||||
# Arrange — an EPC-less Property exists in the database (postcode + UPRN known,
|
||||
# no EPC lodged), plus its postcode cohort behind the faked EPC API.
|
||||
with Session(db_engine) as session:
|
||||
row = PropertyRow(
|
||||
portfolio_id=1, postcode=_POSTCODE, address="1 Target Street", uprn=10000
|
||||
)
|
||||
session.add(row)
|
||||
session.commit()
|
||||
property_id = row.id
|
||||
assert property_id is not None
|
||||
|
||||
cohort_coords = {20000 + i: Coordinates(longitude=-1.55, latitude=53.81) for i in range(3)}
|
||||
comparables_repo = EpcComparablePropertiesRepository(
|
||||
_FakeCohortEpcClient(_cohort_results()), _FakeGeospatialRepo(cohort_coords)
|
||||
)
|
||||
orchestrator = IngestionOrchestrator(
|
||||
unit_of_work=lambda: PostgresUnitOfWork(lambda: Session(db_engine)),
|
||||
epc_fetcher=_NoEpcFetcher(),
|
||||
geospatial_repo=_FakeGeospatialRepo({10000: Coordinates(longitude=-1.55, latitude=53.81)}),
|
||||
solar_fetcher=_NoSolarFetcher(),
|
||||
comparables_repo=comparables_repo,
|
||||
prediction_attributes_reader=_FakeAttributesReader(),
|
||||
epc_prediction=EpcPrediction(),
|
||||
)
|
||||
|
||||
# Act — run Ingestion: no lodged EPC found → predict from the cohort → persist.
|
||||
orchestrator.run([property_id])
|
||||
|
||||
# Assert — reloading the Property through the real repository, its Effective
|
||||
# EPC is the predicted picture, flagged by the "predicted" source path.
|
||||
with Session(db_engine) as session:
|
||||
epc_repo = EpcPostgresRepository(session)
|
||||
prop: Property = PropertyPostgresRepository(
|
||||
session, epc_repo, SpatialPostgresRepository(session)
|
||||
).get(property_id)
|
||||
|
||||
assert prop.epc is None # no lodged EPC
|
||||
assert prop.predicted_epc is not None # a predicted one was persisted
|
||||
assert prop.source_path == "predicted"
|
||||
assert prop.effective_epc is prop.predicted_epc
|
||||
assert prop.effective_epc.property_type == "0"
|
||||
1
tests/fixtures/epc_prediction/BD24JG/cert-01f1488000e8.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-01f1488000e8.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-15c0ce8ea563.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-15c0ce8ea563.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-1ef220911b4b.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-1ef220911b4b.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-208c5dfbaee2.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-208c5dfbaee2.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-496b8c226d26.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-496b8c226d26.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-4cce4c3fb33b.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-4cce4c3fb33b.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-5e427854bd6d.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-5e427854bd6d.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-5f4d37dacdf8.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-5f4d37dacdf8.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-6439eb9f1504.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-6439eb9f1504.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-75aa70bdd22a.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-75aa70bdd22a.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-7e61d706db57.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-7e61d706db57.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-8d1b1e15063c.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-8d1b1e15063c.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-cc1c722822ba.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-cc1c722822ba.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-d1396ff56fec.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-d1396ff56fec.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-d7e3196e1a0c.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-d7e3196e1a0c.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-dcb2c6ff3317.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-dcb2c6ff3317.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-e648d6164f10.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-e648d6164f10.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-e8c34b2323e0.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-e8c34b2323e0.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-eb71d39605ae.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-eb71d39605ae.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-eed1ed76757a.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-eed1ed76757a.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-f089b44ae169.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-f089b44ae169.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-f326c2524ab3.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-f326c2524ab3.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-f3aae3d2c3c9.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-f3aae3d2c3c9.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-f481cd1abc1f.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-f481cd1abc1f.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/BD24JG/cert-f52356b57b37.json
vendored
Normal file
1
tests/fixtures/epc_prediction/BD24JG/cert-f52356b57b37.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-0dd25677d889.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-0dd25677d889.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-3077aedcbe8b.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-3077aedcbe8b.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-5aad1cfe207c.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-5aad1cfe207c.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-5b3816460805.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-5b3816460805.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-6030fde8e888.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-6030fde8e888.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-6e6d1776f8b7.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-6e6d1776f8b7.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-7791c2c9073d.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-7791c2c9073d.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-8a1b88d2a80a.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-8a1b88d2a80a.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-96d09ac53f57.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-96d09ac53f57.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-ca16b6a09f55.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-ca16b6a09f55.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-e54dae311758.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-e54dae311758.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CF481ND/cert-f5de74d7fffc.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CF481ND/cert-f5de74d7fffc.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-1b4b3d26f79c.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-1b4b3d26f79c.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-346dc8ab15a0.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-346dc8ab15a0.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-526df35482d7.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-526df35482d7.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-73d50930d0ac.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-73d50930d0ac.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-8105f351163f.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-8105f351163f.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-840682d5f191.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-840682d5f191.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-f9dee3ea91ac.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-f9dee3ea91ac.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV15QJ/cert-fc5fe3d2a055.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV15QJ/cert-fc5fe3d2a055.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-0246fdfa9718.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-0246fdfa9718.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-2e6f5943059a.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-2e6f5943059a.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-7d9beea6555e.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-7d9beea6555e.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-89894e90fc9c.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-89894e90fc9c.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-91dd248e55ee.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-91dd248e55ee.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-924d78d64f06.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-924d78d64f06.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-94454d5d782e.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-94454d5d782e.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-9b4787ad7813.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-9b4787ad7813.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-9c0b5437b98b.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-9c0b5437b98b.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-aef738e4b1c0.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-aef738e4b1c0.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-ba51394914cf.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-ba51394914cf.json
vendored
Normal file
File diff suppressed because one or more lines are too long
1
tests/fixtures/epc_prediction/CV78UG/cert-cd99c8b93a27.json
vendored
Normal file
1
tests/fixtures/epc_prediction/CV78UG/cert-cd99c8b93a27.json
vendored
Normal file
File diff suppressed because one or more lines are too long
Some files were not shown because too many files have changed in this diff Show more
Loading…
Add table
Reference in a new issue