# ARA Backend Redesign — Design PRD **Status**: Draft for team review **Author**: Khalim Conn-Kowlessar (with Claude grill session) **Branch**: `ara-backend-design-prd` **Scope**: Service architecture + domain model + contracts for the new modelling backend. Linked sub-PRDs cover ML training pipeline, DB schema migration, and historical EPC re-mapping. --- ## 1. Context ### 1.1 The forcing function The current modelling backend (`backend/engine/engine.py` — `model_engine`, 1331 LOC) was built as an MVP. It is: - **Tightly coupled** to a specific gov EPC API that is being **decommissioned on 30 May 2026** (~17 days from today). - **A monolith** — one async function reaches into DB modules, HTTP clients, ML lambdas, S3, and queue infrastructure directly. - **Bottlenecked on a single person** — Khalim is the only contributor able to safely modify the engine because no one else can predict the blast radius of a change. - **Already returning erroneous data** from the old API (clients are aware). The replacement API is partially built (`backend/epc_client/epc_client_service.py`) on the current feature branch. ### 1.2 What needs to change Beyond just swapping API clients, this is the moment to **rebuild the backend into a production-grade, contribute-able codebase**, with: - A clear domain model rooted in the new EPC schema (`EpcPropertyData`). - Service boundaries that other team members can read, fix, and extend without needing the entire mental model. - Repository-mediated persistence so business logic can be tested without spinning up a database. - A separation between **data fetching** (slow, IO-heavy, external) and **modelling** (deterministic, fast, internal). - Baseline kWh and bills derived deterministically from the Effective EPC (SAP physics + UCL correction + per-fuel rates from a refreshable repo) rather than from the EPC's recorded cost fields (which use fuel rates pinned to the inspection date) or from an ML kWh prediction. ### 1.3 Out of scope for this PRD These ship as **linked sub-PRDs**: - **Sub-PRD (ii) — ML training pipeline** (autogluon repo + parquet generation in this repo + scoring model retraining for the new EPC schema) - **Sub-PRD (iii) — DB schema migration** (new tables: `site_notes`, `landlord_overrides`, EPC cache, parallel write strategy) - **Sub-PRD (iv) — Historical EPC re-mapping** (one-off + ongoing batch job: legacy stored EPCs → new `EpcPropertyData` shape) The contracts this PRD defines are the inputs each sub-PRD consumes. --- ## 2. Goals and non-goals ### 2.1 Goals 1. **Survive the 30 May API shutdown** — even if it means a brief degraded window, modelling continues to function against the new gov EPC API. 2. **Decouple data fetching from modelling** — modelling never makes external HTTP calls; it reads everything from repositories. 3. **Make every service unit-testable against fakes** — no test needs a real DB, a real gov API, or a real ML lambda to verify business logic. 4. **Establish a single `Property` aggregate root** as the domain centrepiece; all 9 modelling concerns are slices of one aggregate. 5. **Versioned ML data contract** — the EPC-to-features transform is the single shared artifact between this repo and the autogluon repo. 6. **Per-property UI surfaces** — fetched data can be shown to users for review and override **before** modelling runs; modelling is triggered separately. This will enable a landlord facing version of the product where we fetch the open data, present back to the user for review and then perform the modelling. ### 2.2 Non-goals - Multi-region deploy, GDPR-class data minimisation work, or compliance reporting — separate workstreams. - Replacement of the front-end. The new APIs preserve enough of the existing response shape that the FE migrates incrementally. - Removing pandas. The ML transform output is a parquet-friendly DataFrame-like shape; that stays. - A workflow engine (Prefect / Temporal / Airflow). Coordinator-class orchestration plus the existing SQS-fanout pattern is sufficient at the scale we serve. --- ## 3. Cutover plan Forced cut-over, driven by the 30 May deadline. There is no strangler period because the Old EPC API death takes `model_engine` with it. ### 3.1 Phase 0 — Status quo (now → 30 May) - `model_engine` keeps running against the Old EPC API for as long as it works. - Build of the 9 new services starts **this week**, in parallel to the old engine continuing to serve traffic. - The new `ara/` package lives alongside `backend/` but is not yet wired into any production endpoint. - Goal: keep the lights on until the API dies; start the build immediately so the dark period is short. ### 3.2 Phase 1 — Forced cut-over (30 May onwards) - On 30 May the Old EPC API dies; `model_engine` ceases to function for any new modelling run. - Some downtime is expected and accepted. Clients are aware. - Modelling resumes when the new pipeline is ready end-to-end. Remains to be decided if we have a per-portfolio flag, purely for the front end to reference old tables where necessary. No parallel pipelines, no traffic split — the new pipeline is the only pipeline. - **Calico** and **Hyde** are the first live clients onto the new pipeline in June. - `model_engine`, `SearchEpc`, the legacy `Property`, and surrounding modules in `backend/` are deleted once the new pipeline is serving all traffic. ### 3.3 What is *not* done - No strangler — there is nothing to strangle once the Old EPC API dies on 30 May. - No parallel-shadow run — would double compute and require diff tooling we don't have, while the old engine is already known to return bad data so diffs would be noise. - TBC per-portfolio feature flag. Without this, the cut-over is all-or-nothing. All old portfolios are broken. --- ## 4. Architecture overview ``` ┌─────────────────────────────────────────────────────────────────────┐ │ Trigger endpoint(s) │ │ (one or two — see §4.5; deferred decision) │ └───────────┬──────────────────────────────────────────┬──────────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ IngestionPipe │ SQS, batches of N │ ModellingPipe │ │ ----------- │ ◄─────────────────────│ ----------- │ │ Fetchers run │ │ Reads via Repos │ │ Persist via │ │ Calls Services │ │ Repos │ │ ML predictions │ └────────┬────────┘ └────────┬────────┘ │ │ └───────────────► Repos ◄─────────────────┘ │ ▼ ┌──────────────────┐ │ Postgres tables │ │ (property, │ │ epc_cache, │ │ site_notes, │ │ landlord_ │ │ overrides, │ │ plans, etc.) │ └──────────────────┘ ┌──────────────────────────┐ │ RefreshOrchestrator │ triggers Ingestion → diff → conditionally Modelling └──────────────────────────┘ ``` ### 4.1 Class taxonomy Every class falls into exactly one of four roles: | Role | Job | Examples | |------|-----|----------| | **Fetchers** | Call external APIs. Return raw response data. No DB. | `EpcClientService`, `GeospatialFetcher`, `SolarFetcher`, `SiteNotesIngester` | | **Repos** | Persist and load domain aggregates. SQL hidden inside. No external IO. | `PropertyRepo`, `EpcCacheRepo`, `SiteNotesRepo`, `LandlordOverridesRepo`, `RecommendationsRepo`, `GenericDataRepo`, `SubtaskRepo` | | **Services** | Business logic over domain objects. No external IO except via injected Fetchers / Repos. | `EpcRemappingService`, `EpcPredictionService`, `EpcEnergyDerivationService`, `KwhImpactService`, `ImpactPredictionService`, `RecommendationService`, `OptimiserService`, `FeatureBuilder`, `ResultsPersister` | | **Orchestrators** | Compose Fetchers + Services + Repos to produce an end-to-end result. The only place where step order is encoded. | `IngestionPipeline`, `ModellingPipeline`, `RefreshOrchestrator` | This taxonomy is **strict**. A class that fetches *and* persists belongs in the Service layer and depends on a Fetcher + a Repo. No back-channels. ### 4.2 Two pipelines, one direction Data flows one way only: **Ingestion → Repos → Modelling**. - **Ingestion** writes; never calls Modelling. - **Modelling** reads; never calls Fetchers. If Modelling needs fresh data, it returns "stale" and the caller decides whether to ingest first. This makes Modelling a pure function of repository state, which is the property that makes it reproducible, debuggable, and testable. ### 4.3 RefreshOrchestrator Sits above both pipelines. Job: 1. Trigger `IngestionPipeline` for a portfolio. 2. After ingestion completes, ask repos: "did anything change vs the last modelled snapshot?" 3. If yes, trigger `ModellingPipeline`. If no, return early. This avoids re-modelling 100k properties when only 200 had refreshed EPC data. ### 4.4 SQS fanout (preserved from current architecture) The existing `trigger_plan_entrypoint` SQS-chunking pattern is kept. Both pipelines fan out per batch of ~30–100 properties (tuneable). Each consumer runs one batch end-to-end through the relevant pipeline. UPRN partitioning: the trigger endpoint groups UPRNs by **locality** (postcode prefix / UPRN range) before chunking, so each batch maximises shared upstream fetches (one geospatial-range pull serves all 30 properties in the batch). ### 4.5 One endpoint for v1 For Phase 1 we ship **one trigger endpoint** that internally chains Ingestion → Modelling via `RefreshOrchestrator`. This matches the current FastAPI-fronted Lambda pattern (the FastAPI app in `services//` is a thin entrypoint that invokes the modelling Lambda). We can split into two endpoints later (refresh-only vs model-only) once a real workflow demands it — e.g. a Landlord-Override edit that should re-model without re-fetching open data. The class taxonomy and `RefreshOrchestrator` boundary allow this split without re-architecting. ### 4.6 Trigger contract The trigger payload is reduced compared to today's `PlanTriggerRequest` ([backend/app/plan/schemas.py:98](../../backend/app/plan/schemas.py#L98)) — most of what's currently in the request body moves into the persisted `Scenario` aggregate. ```python class ModelTriggerRequest(BaseModel): portfolio_id: UUID property_ids: list[UUID] | S3Ref # inline up to ~10k, S3 ref above scenario_ids: list[UUID] # 1+; resolved + pinned to ScenarioSnapshot at fan-out task_id: UUID subtask_id: UUID # SQS state machine, preserved from today ``` Everything that used to ride at the top level dies or moves: - `goal`, `budget`, `goal_value`, `inclusions`, `exclusions`, `required_measures`, `enforce_fabric_first`, `scenario_name`, `housing_type` → into `Scenario` / `ScenarioPhase`. - `patches_file_path`, `already_installed_file_path`, `non_invasive_recommendations_file_path` → gone; Landlord Overrides covers all three. - `valuation_file_path` → gone; `ValuationService` derives it. - `ashp_cop`, `default_u_values` → `HeatingSystemAssumptionsRepo` / global config; not per-trigger. - `multi_plan` → gone; `scenario_ids: list[...]` handles N runs natively (one Plan per scenario per property). - `event_type`, `epc_certificate_number`, `lmk_key`, `file_format`, `sheet_name`, `index_start`/`index_end`, `file_type` → ingestion-side concerns; if needed, ride on a separate ingestion-trigger payload. **Scenario snapshotting**: at fan-out time `RefreshOrchestrator` reads each requested `Scenario`, writes a `ScenarioSnapshot` keyed by `(task_id, scenario_id)`, and per-batch SQS messages reference the snapshot. Mid-run edits to the live `Scenario` do not affect an in-flight modelling job. Snapshots are read-only and can be garbage-collected after the task completes. --- ## 5. Domain model ### 5.1 Aggregate root: `Property` `Property` is the centrepiece. Every service operates on one or more `Property` instances. Every repo writes one slice of `Property`. The aggregate carries all state for a single property's modelling run. ```python @dataclass class PropertyIdentity: portfolio_id: UUID uprn: Optional[int] landlord_property_id: Optional[str] address: AddressLines postcode: str @dataclass class Property: identity: PropertyIdentity # --- Source data — modelling path is determined by which of these are set --- epc: Optional[EpcPropertyData] # from gov API (or remapped historical) site_notes: Optional[SiteNotes] # our own survey; supersedes EPC when present landlord_overrides: Optional[LandlordOverrides] # sparse, only meaningful when epc set # --- Enrichments --- geospatial: Optional[GeoSpatial] solar: Optional[SolarPotential] epc_anomaly_flags: Optional[EpcAnomalyFlags] # from EpcPredictionService vs neighbours # --- Modelling outputs --- baseline_performance: Optional[BaselinePerformance] # carries lodged + effective pair; see §5.4 recommendations: list[Recommendation] impact_predictions: Optional[ImpactPredictions] plans: list[Plan] # one per Scenario the property was modelled against # --- Derived --- @property def source_path(self) -> Literal["site_notes", "epc_with_overlay"]: ... @property def effective_epc(self) -> EpcPropertyData: """The EPC the modelling pipeline actually scores against.""" ... ``` ### 5.2 `Properties` collection A first-class iterable, so batch operations are obvious: ```python @dataclass class Properties: items: list[Property] def __iter__(self) -> Iterator[Property]: ... def __len__(self) -> int: ... def filter(self, pred: Callable[[Property], bool]) -> "Properties": ... def map(self, fn: Callable[[Property], Property]) -> "Properties": ... def with_landlord_overrides(self) -> "Properties": ... ``` Services typically take and return `Properties`, not lists. ### 5.3 Other aggregates | Aggregate | Owns | Repo | |---|---|---| | `Property` | property identity, epc, site_notes, landlord_overrides, enrichments, modelling results | `PropertyRepo` | | `Plan` | per-property modelling output for one Scenario: ordered `phases: list[PlanPhase]`, each carrying its `OptimisedPackage`, ending state snapshot, and rolled-over options | `RecommendationsRepo` | | `Scenario` | portfolio-wide scenario metadata (goal, budget, exclusions, housing type) plus ordered `phases: list[ScenarioPhase]`; each phase carries `measure_types_allowed`, phase budget, phase target | `RecommendationsRepo` | | `ScenarioSnapshot` | frozen copy of a `Scenario` pinned at trigger time, keyed by `(task_id, scenario_id)`, so mid-run scenario edits don't affect an in-flight modelling job | `RecommendationsRepo` | | `Subtask` / `Task` | SQS fanout state | `SubtaskRepo` | | `EpcCache` | gov-API responses keyed by UPRN, with freshness/TTL | `EpcCacheRepo` | | `GenericData` | UPRN-range geospatial, postcode lookups, shared static data | `GenericDataRepo` | | `FuelRates` | time-versioned, region-aware per-fuel rates (pence/kWh), standing charges, SEG export rate, calorific values | `FuelRatesRepo` | | `CarbonFactors` | time-versioned per-fuel CO2 emission factors (kgCO2e/kWh); Defra publishes annually | `CarbonFactorsRepo` | | `HeatingSystemAssumptions` | boiler efficiency tables, ASHP/GSHP COPs, solar-thermal coverage proportion; per-property physical assumptions, not fuel-market data | `HeatingSystemAssumptionsRepo` | Aggregates are loaded **whole** — never half a `Property`. If a slice is too large to load eagerly (e.g. recommendation history), it lives in a separate aggregate. A single-phase Scenario is `phases: []` with all measure types allowed and the full budget on it — no special-case path through the pipeline. ### 5.4 `BaselinePerformance` carries lodged + effective ```python @dataclass class BaselinePerformance: # As-lodged: unmodified EPC fields (or Site Notes' recorded values where Site Notes are the source). lodged_sap: int lodged_band: Epc lodged_carbon: float lodged_heat_demand: float # Effective: what the modelling pipeline actually scored against. # Equals lodged when neither rebaselining trigger fires; equals ML output when rebaselined. effective_sap: int effective_band: Epc effective_carbon: float effective_heat_demand: float # kWh / fuel split / bills — always derived deterministically from the Effective EPC by # EpcEnergyDerivationService (SAP physics + UCL correction + FuelRates lookup). # Lodged kWh / bills are not stored separately — the EPC's recorded cost fields are pinned to # inspection-date fuel rates, so we always re-derive bills from current FuelRates regardless. annual_kwh: float fuel_split: dict[Fuel, float] annual_bills: dict[Fuel, float] rebaselined: bool rebaseline_reason: Optional[Literal["pre_sap10", "physical_state_changed", "both"]] ``` The pair lets the FE show "lodged rating vs SAP10-equivalent rebaselined rating" side by side without a separate query. Both fields are always populated; when no rebaselining trigger fires, `effective_*` equals `lodged_*`. --- ## 6. Source-of-truth and overlay precedence There are exactly **two modelling paths**. The `Property.source_path` property selects. ### 6.1 Path 1 — Site notes If a `Property` has `site_notes` and they are newer than any available EPC (or no EPC exists), site notes are the **complete** source of truth: - `effective_epc` = `site_notes.to_epc_property_data()`. - EPC fields not covered by site notes — **none expected**. Site notes are committed to being a full-coverage survey. Treat any gap as a survey-quality bug, not a fallback signal. - `LandlordOverrides` are not applicable in Path 1 (the survey supersedes). ### 6.2 Path 2 — EPC with landlord overlay If a `Property` has no site notes (or the EPC is newer): - `effective_epc` = `epc` with `landlord_overrides` applied as a sparse field-level overlay (`landlord > epc`). - `LandlordOverrides` are sparse: each row represents one corrected field. Schema TBD at implementation time; assume flat input via Excel/CSV for v1, with a flag to revisit shape after first customer onboarding. ### 6.3 Recency tie-break When a property has **both** site notes and a public EPC, the newer of the two wins. Rationale: a recent EPC may reflect retrofit work done after our survey; conversely a recent survey reflects on-site observations the EPC cannot capture. This tie-break is implemented in `Property.source_path` and may be tuned later (e.g. always prefer surveys regardless of date, or per-portfolio policy). ### 6.4 Rebaselining trigger ML re-predicts SAP / carbon / heat when **either** of these holds: 1. **Pre-SAP10 schema** — `effective_epc.sap_version < 10.0`. The EPC was rated under SAP 2012 (or earlier) and we want a SAP10-equivalent baseline so all properties are scored against the same model version. Canonical signal is the `sap_version: float` field; fall back to `schema_type` string, then to `lodgement_date` if both are absent. Site Notes are assumed SAP10 by construction (PasHub / ECMK produce them now) — Path 1 typically doesn't trigger this leg. 2. **Physical state changed** — `effective_epc` differs from the lodged EPC's physical fields (walls / heating / windows / etc.). Triggered by Landlord Overrides changing physical state, or by Site Notes that contradict the lodged EPC. When triggered, a single ML call re-predicts SAP/carbon/heat with the current Effective EPC state as input. Both reasons can fire together; the prediction is still one call. kWh is **always** re-derived via `EpcEnergyDerivationService` — even when no ML rebaseline runs — because the EPC's recorded cost fields use fuel rates pinned to the inspection date, and current rates from `FuelRatesRepo` are what we want to surface to users. The diff mechanism for "physical state changed" (content hash, dirty flag, etc.) is an implementation detail; start with a content hash of the physical-state subset of `EpcPropertyData` stored alongside the previous run. ### 6.5 Deprecated concepts - **Patches** (`patch_epc`) — removed. Functionality subsumed by `LandlordOverrides`. - **Already-installed measures** — likely subsumed by `LandlordOverrides` ("we have a heat pump now" → override heating fields). Confirmed at implementation time. - **Non-invasive recommendations** — TBD whether this concept survives; not blocking. --- ## 7. Persistence: repositories and unit of work ### 7.1 What a repository is A repository owns the SQL for one aggregate. Nothing else writes SQL for that aggregate. Callers see only domain objects. ```python class PropertyRepo(Protocol): def get(self, identity: PropertyIdentity) -> Optional[Property]: ... def bulk_save(self, uow: UnitOfWork, properties: Properties) -> None: ... def find_by_portfolio(self, portfolio_id: UUID) -> Properties: ... def find_stale(self, portfolio_id: UUID, threshold: timedelta) -> Properties: ... ``` Implementation references current `db_funcs.*` modules during phase 0 to avoid a big-bang SQL rewrite, but the interface is fixed. ### 7.2 Unit of Work Multi-table writes inside a single aggregate, or across aggregates that share a transaction (e.g. property + plan + recommendations) go through a `UnitOfWork`: ```python with self.uow_factory() as uow: self.property_repo.bulk_save(uow, properties) self.recommendations_repo.bulk_save(uow, plans) uow.commit() ``` UoW owns the SQLAlchemy session lifecycle. Repos use the session passed in via the UoW. Outside a UoW, repos use a short-lived read session. ### 7.3 Repository inventory | Repo | Tables it owns | |------|----------------| | `PropertyRepo` | `properties`, `property_details_epc`, `property_spatial` | | `EpcCacheRepo` | new table: `epc_api_cache` (TTL, raw API response, mapped `EpcPropertyData`) | | `SiteNotesRepo` | new table: `site_notes` (replaces current `energy_assessments`) | | `LandlordOverridesRepo` | new table: `landlord_overrides` (sparse, per-field rows for audit) | | `RecommendationsRepo` | `plans`, `plan_phases`, `recommendations`, `recommendation_parts`, `scenarios`, `scenario_phases`, `scenario_snapshots` | | `GenericDataRepo` | new table or S3-backed: UPRN-range geospatial + postcode-keyed shared static data | | `FuelRatesRepo` | new table: `fuel_rates` — `(fuel_type, rate_pence_per_kwh, standing_charge_pence_per_day, calorific_value_kwh_per_unit, unit, effective_from, effective_to, region_code Optional, source)`. SEG export rate is a row with `fuel_type = 'electricity_export'`. | | `CarbonFactorsRepo` | new table: `carbon_factors` — `(fuel_type, kgco2e_per_kwh, effective_from, effective_to, source)`. Defra publishes annually. | | `HeatingSystemAssumptionsRepo` | new table(s): boiler efficiency, ASHP/GSHP COP, solar-thermal coverage proportion. Static-ish, manual refresh. | | `SubtaskRepo` | `tasks`, `subtasks` (existing) | DDL migrations are scoped to sub-PRD (iii). ### 7.4 Fakes For tests, each repo has a `FakeXRepo` companion backed by a dict. Service unit tests inject fakes. No DB required. --- ## 8. ML contract ### 8.1 Where ML lives | Concern | Owner | |---|---| | Defining the EPC → features transform | **This repo** (`ara.domain.ml.EpcMlTransform`) | | Loading data, applying transform, writing training parquet to S3 | **This repo** (sub-PRD (ii) batch job) | | Training, hyperparameter search, deployment | **Autogluon repo** | | Scoring at modelling time | **This repo** (`FeatureBuilder` calls `EpcMlTransform`, sends DataFrame to deployed lambda) | The autogluon repo is intentionally **dumb**: it consumes parquet, knows which column is the target, knows which columns to ignore. It has no EPC semantics. ### 8.2 `EpcMlTransform` A separate class (not a method on `EpcPropertyData`), because: - The data class stays clean of training-infrastructure concerns. - Versioned transforms (`EpcMlTransformV1`, `EpcMlTransformV2`) swap easily. - Future need: injection of normalisation stats from the training set is straightforward on a class, awkward on a dataclass. ```python class EpcMlTransform: VERSION: str = "1.0.0" # semver def to_row(self, epc: EpcPropertyData) -> dict[str, Any]: ... def to_rows(self, properties: Properties) -> pd.DataFrame: ... def schema(self) -> dict[str, type]: ... # for parquet emission + validation ``` The interesting work — flattening `List[SapWindow]`, `List[SapBuildingPart]` into fixed-width columns — lives inside this class. Domain decisions (top-N windows, aggregate roofs, etc.) are encoded here and reviewed by Khalim. Sub-PRD (ii) goes into detail. ### 8.3 Versioning - Transform class is **semver-tagged** (`VERSION = "1.0.0"`). - S3 path for training parquet includes the version: `s3://.../training/v1.0.0/...`. - Deployed scoring lambda is tagged with the transform version it was trained against. - Modelling pipeline asserts at startup that its `EpcMlTransform.VERSION` matches the deployed lambda's tag; mismatch = hard fail at deploy time. Bump major when removing or renaming columns. Bump minor when adding optional columns (older models still scoreable; new models can be trained against new fields). ### 8.4 ML model families Both ML calls (rebaselining + per-measure impact) use the same `EpcMlTransform`: | Service | Lambda | Target | |---|---|---| | `RebaseliningService` (S4b) | `baseline-models-*` | SAP / carbon / heat demand under the current Effective EPC state (SAP10-equivalent) | | `ImpactPredictionService` (S6) | `impact-models-*` | SAP / carbon / heat demand impact per measure (and per battery option, using new EPC battery fields) | Annual kWh and bills are never an ML target — derived deterministically by `EpcEnergyDerivationService` (S4a). Recommendation kWh delta is derived from the SAP delta predicted by S6 plus heating-system fuel + COP, not via a separate ML call. The two families are trained against the same input feature schema; only target columns differ. Sub-PRD (ii) handles training-time details. --- ## 9. Service catalogue The classes below implement the pipeline end-to-end. Detailed signatures are deliberately left for implementers — this PRD documents purpose, dependencies, and rough shape; per-service grill sessions produce the contracts. **Out of the legacy engine** (deleted, not migrated): `PredictionMatrix` (debug-only, moves to test fixtures), `extract_portfolio_aggregation_data` (dead code, FE aggregates dynamically per §10), inspections plumbing (`inspections_map` is initialised but never populated in the current engine), patches / `already_installed` / `non_invasive_recommendations` (subsumed by Landlord Overrides), ECO4 / WHLG funding integration (`get_funding_data` and `optimise_with_scenarios`' funding paths), the pre-recommendation kWh ML lambda (`KWH_MODEL_PREFIXES`), and floor-count / heat-loss-perimeter estimation from geospatial (now on `EpcPropertyData`). Address matching (`address2UPRN`) lives as a separate service, not inside `EpcClientService`. ### 9.1 Fetchers (called by `IngestionPipeline`) | # | Class | Purpose | Dependencies | |---|---|---|---| | F1 | `EpcClientService` | Fetches EPCs from new gov API. Already exists at `backend/epc_client/`. Scope narrows compared to current `SearchEpc` — address matching (`address2uprn`) and OS API estimation are not its concern. | httpx | | F2 | `GeospatialFetcher` | Fetches UPRN-range geospatial data. Replaces `OpenUprnClient`. **Floor count and heat-loss perimeter estimation are no longer needed** — both are now on `EpcPropertyData` directly (`number_of_storeys`, `SapFloorDimension.heat_loss_perimeter_m`). Scope reduces to building geometry and postcode-area context. | S3 / Ordnance Survey API | | F3 | `SolarFetcher` | Wraps Google Solar API; building-level + unit-level scenes. | Google Solar API | | F4 | `SiteNotesIngester` | Loads site notes from Excel uploads / structured input. Persists via `SiteNotesRepo`. | S3, repo | | F5 | `FuelRatesFetcher` | Scheduled ETL — scrapes Ofgem regional caps and per-fuel rates, writes timeseries rows to `FuelRatesRepo`. Manual CSV upload fallback for off-cycle corrections. | Ofgem feed, repo | | F6 | `CarbonFactorsFetcher` | Same shape as F5 against Defra's annual CO2 factor publication. | Defra feed, repo | ### 9.2 Domain services (called by `ModellingPipeline`) | # | Class | Original-list # | Purpose | Reads | Writes | |---|---|---|---|---|---| | S1 | `EpcRemappingService` | 4 | Re-map legacy / historical EPCs into new `EpcPropertyData` shape. | `EpcCacheRepo` | `EpcCacheRepo` (mapped column) | | S2 | `EpcPredictionService` | 3 | For every property: produce predicted EPC + per-field anomaly flags vs neighbours. Used both for gap-fill (Path 2 if EPC missing) and UI surfacing. | `EpcCacheRepo`, `GenericDataRepo` | — | | S3 | `FeatureBuilder` | (new) | Wraps `EpcMlTransform`. Converts `Properties` → scoring DataFrame. | — | — | | S4a | `EpcEnergyDerivationService` | (new) | Derives annual kWh + fuel split + bills from the Effective EPC. Deterministic, no ML. Pipeline: (1) source regulated PEUI — either from `energy_consumption_current × floor_area` when EPC field present and no physical override, or from SAP physics (heat demand × area + SAP hot-water + SAP lighting) for Site Notes / overridden cases; (2) add appliance + cooking via SAP Appendix L formulas (port of [`AnnualBillSavings.estimate_appliances_energy_use`](../../backend/ml_models/AnnualBillSavings.py)); (3) apply UCL per-band correction (Few et al. 2023, Table 3), keyed on the **post-state Effective EPC's band** — not the lodged band; (4) decompose total PEUI into end-use shares via SAP-physics proportions; (5) primary→delivered per fuel using SAP primary factors; (6) bills = delivered kWh per fuel × current rate from `FuelRatesRepo` + standing charges + SEG credits. CO2 emissions from `CarbonFactorsRepo`. | `FuelRatesRepo`, `CarbonFactorsRepo`, `HeatingSystemAssumptionsRepo` | — | | S4b | `RebaseliningService` | (new, partial overlap with old "rebaselining" logic) | Triggered by §6.4 conditions (pre-SAP10 schema **or** physical state changed). Calls SAP/carbon/heat ML lambdas to produce SAP10-equivalent baseline against the current Effective EPC state. Both `BaselinePerformance.lodged_*` and `effective_*` are populated downstream — pair is always stored, equal when not rebaselined. kWh is re-derived via S4a, not ML. | `FeatureBuilder` | — | | S5 | `RecommendationService` | 6 | Generates per-property recommendations against the current rolling Effective EPC. Invoked **once per (scenario × phase)** — filters candidates to the phase's `measure_types_allowed`, returns candidates eligible against the post-prior-phase state. Replaces current `Recommendations` (1383 LOC). | `MaterialsRepo` | — | | S6 | `ImpactPredictionService` | 7 | Calls SAP / carbon / heat impact ML lambda for **every** candidate recommendation (FE displays all options to user). Invoked per (scenario × phase) with the rolling state's feature vector. Recommendation kWh delta is derived deterministically from SAP delta + heating-system fuel/COP, not from a separate ML call. Battery impact uses the new EPC battery fields (`energy_pv_battery_count`, `energy_pv_battery_capacity`) as ML inputs — the deterministic `BatterySAPScorer` from the legacy engine is replaced by ML prediction. | `FeatureBuilder` | — | | S7 | `OptimiserService` | 8 | Per-phase optimisation against rolling state. Reads `PlanPhase.state_at_end[n-1]` to honour cross-phase constraints (fabric-first, heat-pump-needs-insulation, ventilation). Wraps current `CostOptimiser` / `GainOptimiser` / `optimise_with_scenarios` minus the dead ECO-funding paths. Unselected candidates roll into phase n+1's candidate pool (auto vs user-marked TBD, §15). | — | — | | S8 | `ValuationService` | — | Estimates per-property valuation (current + post-retrofit) from academic-paper-based regression on EPC change, property type, region. Improvement on the existing `PropertyValuation.estimate` code — exact shape deferred to per-service grill. | — | — | | S9 | `ResultsPersister` | 9 | Final step: writes Plan (with `phases[]`) + Recommendations + Property updates via repos under one UoW, per scenario. | — | All write repos | ### 9.3 Orchestrators | # | Class | Purpose | |---|---|---| | O1 | `IngestionPipeline` | Per-batch SQS consumer. Calls F1–F4, persists via repos. | | O2 | `ModellingPipeline` | Per-batch SQS consumer. Reads from repos, runs S1→S8 in order, ends with persistence. | | O3 | `RefreshOrchestrator` | Top-level: triggers Ingestion → diff → optionally Modelling. | ### 9.4 `ModellingPipeline` step order For each `Property` in the batch, against each pinned `ScenarioSnapshot` from the trigger payload: ``` Per-property setup (runs once regardless of scenario count): 1. PropertyRepo.get() → Property (epc, site_notes, overrides, geospatial, solar) 2. EpcRemappingService — if epc is in legacy schema, upgrade to current 3. EpcPredictionService — predicted EPC + per-field anomaly flags (always runs) 4. Compute Property.effective_epc (path-1 or path-2) 5. RebaseliningService — IF §6.4 conditions hold (pre-SAP10 OR physical state changed), re-predict SAP/carbon/heat via ML against the Effective EPC state. Populate BaselinePerformance.lodged_* + effective_*. 6. EpcEnergyDerivationService — SAP-physics + UCL (post-state band) + FuelRates → kWh, fuel split, bills. Per-scenario loop: Per-phase loop (in scenario phase order): 7. RecommendationService — generate candidate measures, restricted to phase's measure_types_allowed, against the rolling Effective EPC state (baseline for phase 1; updated for phase 2+). 8. ImpactPredictionService — predict SAP/carbon/heat impact for those candidates, ML scored against the rolling state's feature vector. All candidates scored (FE shows options). 9. OptimiserService — select package within phase budget + phase goal. Reads earlier-phase state to honour cross-phase constraints (fabric-first, heat-pump-needs-insulation, ventilation). 10. Apply package → roll state forward (simulate post-package SAP / kWh / bills via S4a + impact predictions from step 8). Record `PlanPhase.state_at_end`. Unselected options become `PlanPhase.rolled_over_options` and are eligible candidates next phase. 11. ResultsPersister — write Plan (phases[]) + Recommendations under one UoW for this scenario. ``` Steps 1–6 run **once per property** regardless of scenario count. Steps 7–10 run **once per (scenario × phase)** per property. Step 11 runs once per scenario per property. Batching: steps 5, 8 batch the whole batch into one ML call where possible. Step 8's cost scales with `N_phases × N_scenarios × N_candidate_measures`; multi-phase pays its own ML bill, single-phase scenarios cost the same as today. Note vs the current `model_engine`: the **pre-recommendation** kWh ML call has been removed. Baseline kWh now comes from `EpcEnergyDerivationService` (SAP physics + UCL + FuelRates). ML is reserved for SAP/carbon/heat (rebaselining + impact prediction). Recommendation-level kWh delta is derived deterministically from the impact-predicted SAP delta plus heating-system fuel + COP from `HeatingSystemAssumptionsRepo`; no separate kWh ML lambda. **Open future change** (flagged §15): SAP-impact-of-a-measure is not strictly additive — installing measure A changes the SAP impact of measure B. The current per-measure ML scoring + linear optimisation approximates this. A future iteration may pre-define candidate packages and ML-score whole packages, accepting the combinatorial cost in return for accuracy. Defer until implementation reveals where the approximation hurts. ### 9.5 Per-service contracts — deferred Method signatures, return types, error semantics, and edge-case behaviour are **explicitly out of scope** for this PRD. The implementer of each service runs a `/grill-me` session against this document and produces a detailed sub-design before coding. --- ## 10. Cross-batch concerns | Concern | Status | Approach | |---|---|---| | Building-level solar adjustment | Deferred — future TODO, not implemented today. | The current `building_ids` block in `model_engine` is dead-ish; it operates on the in-process batch only. New design preserves that limitation. Future feature: a post-modelling consolidation pass that groups results by `building_id` across batches and re-optimises. | | Portfolio aggregation | Dropped. | Front-end computes aggregations dynamically from per-property plans. `extract_portfolio_aggregation_data` in current engine is dead code (defined, never called) — deleting. | | Shared upstream data | Handled by orchestrator partitioning + `GenericDataRepo`. | Trigger endpoint groups UPRNs by postcode / UPRN-range before SQS chunking so each batch maximises intra-batch sharing. `GenericDataRepo` caches across batches so first batch pays, subsequent batches hit cache. | --- ## 11. Repository layout — monorepo via uv workspaces The repo is restructured as a Python monorepo using **uv workspaces**. Shared types and shared infra live as workspace packages under `packages/`; each deployable Lambda or microservice lives as its own package under `services/`. Each `services//` has its own `pyproject.toml`, `Dockerfile`, and Lambda image — the bundle contains only that service's deps + its workspace deps, keeping cold-start size and package weight contained. ``` / ├── pyproject.toml # workspace root ├── uv.lock │ ├── packages/ # shared workspace packages — imported by services/ │ ├── domain/ # "domna-domain" │ │ ├── pyproject.toml │ │ └── src/domain/ │ │ ├── property.py # Property, Properties, PropertyIdentity │ │ ├── site_notes.py │ │ ├── landlord_overrides.py │ │ ├── baseline_performance.py # lodged + effective pair │ │ ├── plan.py # Plan, PlanPhase, OptimisedPackage │ │ ├── scenario.py # Scenario, ScenarioPhase, ScenarioSnapshot │ │ ├── recommendation.py │ │ ├── geospatial.py │ │ ├── solar.py │ │ ├── anomaly_flags.py │ │ └── ml/ │ │ ├── transform.py # EpcMlTransform (versioned) │ │ └── schema.py │ │ │ ├── repos/ # "domna-repos" — persistence, no business logic │ │ ├── pyproject.toml │ │ └── src/repos/ │ │ ├── unit_of_work.py │ │ ├── property_repo.py │ │ ├── epc_cache_repo.py │ │ ├── site_notes_repo.py │ │ ├── landlord_overrides_repo.py │ │ ├── recommendations_repo.py │ │ ├── generic_data_repo.py │ │ ├── fuel_rates_repo.py │ │ ├── carbon_factors_repo.py │ │ ├── heating_system_assumptions_repo.py │ │ └── subtask_repo.py │ │ │ ├── fetchers/ # "domna-fetchers" — external API clients │ │ ├── pyproject.toml │ │ └── src/fetchers/ │ │ ├── epc_client.py # wraps backend/epc_client/ │ │ ├── geospatial.py │ │ ├── solar.py │ │ ├── fuel_rates_fetcher.py │ │ └── carbon_factors_fetcher.py │ │ │ └── utils/ # "domna-utils" — logging, AWS, S3, cloudwatch, subtasks │ ├── pyproject.toml │ └── src/utils/ │ ├── services/ # deployable units, one Lambda image each │ ├── ara/ # the modelling backend │ │ ├── pyproject.toml # deps: domna-domain, domna-repos, domna-fetchers, domna-utils, ML libs │ │ ├── Dockerfile │ │ ├── src/ara/ │ │ │ ├── services/ # EpcRemappingService, EpcPredictionService, │ │ │ │ # EpcEnergyDerivationService, RebaseliningService, │ │ │ │ # FeatureBuilder, RecommendationService, │ │ │ │ # ImpactPredictionService, OptimiserService, │ │ │ │ # ValuationService, ResultsPersister │ │ │ ├── orchestrators/ # IngestionPipeline, ModellingPipeline, RefreshOrchestrator │ │ │ └── lambdas/ # handler.py per Lambda + event-shape contracts │ │ └── tests/ │ │ ├── fakes/ # FakePropertyRepo, FakeEpcClient, etc. │ │ ├── unit/ # service tests using fakes only │ │ └── integration/ # real DB + real SQS via localstack │ │ │ ├── address2uprn/ # messy-address → UPRN matching, pre-modelling step │ │ ├── pyproject.toml │ │ ├── Dockerfile │ │ └── src/address2uprn/ │ ├── hubspot/ # existing Hubspot ETL │ ├── pashub/ # PasHub survey ingestion │ ├── ecmk/ # ECMK assessment ingestion │ └── magicplan/ # MagicPlan integration │ ├── backend/ # legacy FastAPI app + microservices, kept until cut-over │ ├── app/ # FastAPI; thin entrypoints that invoke service Lambdas │ └── ... # legacy engine, SearchEpc, etc.; deleted after cut-over │ ├── datatypes/ # existing — EPC schemas; eventually folds into packages/domain/ └── docs/ └── adr/ # architectural decision records ``` **Boundary properties** (enforced by package structure, not convention): - A `services//` package can `import domain.*`, `import repos.*`, `import fetchers.*`, `import utils.*`. It **cannot** import another service's modules — they're separate distributions with no cross-import path. - ADR-0003 (Ingestion / Modelling separation) is preserved: modelling services in `services/ara/src/ara/services/` depend only on `repos.*` + `domain.*`, never on fetchers. Orchestrators are the only place fetchers and services meet. **Migration** (incremental, not big-bang): 1. Carve out `packages/domain/` first — fold `datatypes/epc/domain/` + the new aggregate types into it. 2. Carve out `packages/utils/` from current `utils/` + `backend/utils/`. 3. Carve out `packages/repos/` and `packages/fetchers/` once `services/ara/` is being built and needs them. 4. `services/ara/` is greenfield — no legacy code lives in it. 5. `services/address2uprn/`, `services/pashub/`, etc. are split out as their owners pick them up. 6. `backend/` shrinks to the FastAPI entrypoint layer once everything else has moved. **Reused intact** (no rewrite needed at carve-out time): - `backend/epc_client/` → folds into `packages/fetchers/src/fetchers/epc_client.py`. - `datatypes/epc/domain/` → folds into `packages/domain/src/domain/epc/`. - `recommendations/optimiser/` → wrapped by `services/ara/src/ara/services/optimiser.py`. - `backend/app/db/` → repos delegate into `db_funcs.*` until SQL is rewritten under sub-PRD (iii). --- ## 12. Testing strategy ### 12.1 Unit tests (the bulk) Every service test injects fake fetchers and fake repos. No DB, no network, no ML lambda. A service test verifies one slice of logic in 5–30 lines. Example: ```python def test_epc_prediction_flags_anomalous_wall_type(): neighbours = [_make_epc(wall_construction="solid") for _ in range(5)] target = _make_property(epc=_make_epc(wall_construction="cavity")) repo = FakeGenericDataRepo(neighbours_by_postcode={target.identity.postcode: neighbours}) svc = EpcPredictionService(generic_repo=repo) result = svc.run(Properties([target])) assert result[0].epc_anomaly_flags.wall_construction == "differs_from_neighbours" ``` ### 12.2 Integration tests One per pipeline (Ingestion, Modelling, Refresh). Real Postgres (testcontainers or localstack), fake fetchers (hitting recorded fixtures), fake ML lambdas (returning canned predictions). Catches schema / SQL / transaction issues. ### 12.3 Contract tests The transform (`EpcMlTransform`) has its own test suite: - Golden file: given a fixed `Property`, output matches an expected DataFrame row exactly. - Schema test: the output columns exactly match a checked-in CSV header (so autogluon team sees breakage on PR). ### 12.4 What is NOT tested - The autogluon repo's training code — owned there. - The gov EPC API behaviour — assumed via the official spec. - Front-end aggregation logic — owned there. --- ## 13. Observability Each pipeline step emits a **structured log line** at start and end with: ``` {step, property_id, uprn, portfolio_id, subtask_id, duration_ms, outcome, error?} ``` Errors propagate with the `Property.identity` attached, so a portfolio of 100k can be triaged by grep. The existing task/subtask state machine is preserved — `IngestionPipeline` and `ModellingPipeline` update subtask status at start (`in progress`), end (`complete` / `failed`), with the CloudWatch log URL attached as today. CloudWatch alarms exist on subtask failure rate; thresholds remain unchanged. --- ## 14. Data flow: a worked example A landlord uploads a corrected heating system for UPRN 12345 via the UI. 1. **UI** → `POST /properties/12345/overrides` → writes to `landlord_overrides` table via `LandlordOverridesRepo`. 2. **RefreshOrchestrator** invoked (either automatically on override-write, or by a "re-model" button). Notes: ingestion is *not* triggered because no external state changed. 3. **ModellingPipeline** invoked on a batch of `[12345]`: - Reads `Property(uprn=12345)` from `PropertyRepo`. - `Property.effective_epc` = epc + landlord_overrides → heating system fields differ from baseline. - `RebaseliningService` triggered: ML re-predicts SAP / carbon / heat against the new effective EPC. - `EpcEnergyDerivationService` re-runs over the new effective EPC to derive baseline kWh + fuel split + bills (no ML). - `RecommendationService` regenerates recommendations against the new baseline. - `OptimiserService` re-picks optimal package. - `ResultsPersister` writes new plan under one UoW (old plan is superseded; whether to soft-archive is a sub-PRD (iii) decision). Total external calls: zero. The override write is the only thing that hit a network boundary, and that was the inbound HTTP from the UI. --- ## 15. Open questions for team review 1. **One endpoint vs two** (§4.5) — **resolved**: single endpoint for Phase 1; split later when a real workflow demands it. 2. **`LandlordOverrides` shape** (§6.2) — flat-Excel-shape for v1, with a flag to revisit after first customer. 3. **`already_installed` and `non_invasive_recommendations`** (§6.5) — both likely subsumed by overlay, but final call deferred. 4. **Recency tie-break policy** (§6.3) — default "newer wins"; team to consider per-portfolio override. 5. **`GenericDataRepo` storage backend** — Postgres table, S3, or DynamoDB. Postgres is the path of least infra change; recommend defaulting to that. 6. **Soft-archive vs hard-overwrite** for superseded plans (§14) — affects audit / undo behaviour. Defer to sub-PRD (iii). 7. **Building-level optimisation as a Phase 2 service** (§10) — agreed deferred; flag for roadmap discussion. 8. **Transform versioning policy** (§8.3) — semver chosen; team to confirm bump conventions. 9. **UCL EPC-correction model** (§9.2 S4a) — **resolved**: Few et al. 2023 (Energy & Buildings 288, 113024). Implementation pattern already in [`AnnualBillSavings.adjust_energy_to_metered`](../../backend/ml_models/AnnualBillSavings.py) — port the per-band gradients/intercepts (Table 3) into `EpcEnergyDerivationService`, keyed on the post-state Effective EPC band. 10. **Fuel-price source for bill calculation** (§9.2 S4a) — **resolved**: `FuelRatesRepo` is a time-versioned, region-aware table; ETL by `FuelRatesFetcher` (Ofgem feed + manual upload fallback). Per-portfolio override deferred to v2 — confirm whether Calico / Hyde have bulk-buy contracts before first onboarding. 11. **kWh handling under Rebaselining** (§9.4) — **resolved**: ML re-predicts SAP/carbon/heat only; `EpcEnergyDerivationService` re-derives kWh from the rebaselined Effective EPC. Heating-fuel-type change is handled naturally because S4a re-reads heating fields from the Effective EPC. 12. **Phase rollover semantics** (§9.2 S7) — when a candidate measure isn't selected in phase n, does it auto-roll into phase n+1's candidate pool, or does the user mark which measure types can roll? Auto is simpler; user-marked is more flexible. Decide at scenario-builder UX time. 13. **Package-level vs per-measure ML scoring** (§9.4) — SAP impact of a measure is not strictly additive; the current per-measure scoring + linear optimisation approximates this. A future iteration may pre-define candidate packages and ML-score whole packages. Defer until per-service grill on `OptimiserService`. 14. **UCL extrapolation scope** (§9.2 S4a) — the Few et al. paper is gas-heated, no PV, England + Wales only. Current legacy code applies the correction to all properties regardless. Keep silent extrapolation for v1, or stratify (no correction for non-gas / PV) and surface uncertainty to FE? Defer to per-service grill. 15. **`ValuationService` rebuild** (§9.2 S8) — existing `PropertyValuation.estimate` cites several papers; the rebuild should improve the regression. Shape deferred to per-service grill. 16. **Battery-via-ML cutover** (§9.2 S6) — confirm the new ML model is trained against `energy_pv_battery_count` + `energy_pv_battery_capacity` and the legacy `BatterySAPScorer` can be retired without regression for battery-equipped properties. --- ## 16. Linked sub-PRDs (placeholders) - **Sub-PRD (ii) — ML training pipeline** — `docs/sub-prds/ml-training-pipeline.md` (TBC) - **Sub-PRD (iii) — DB schema migration** — `docs/sub-prds/db-schema-migration.md` (TBC) - **Sub-PRD (iv) — Historical EPC re-mapping** — `docs/sub-prds/historical-epc-remap.md` (TBC) Each sub-PRD owner: TBC. Each is independently reviewable but consumes the contracts defined in §5 (`Property` aggregate), §7 (repos), §8 (ML transform). --- ## 17. Next steps 1. Team review of this PRD (target: ~1 week). 2. Open follow-up grill sessions per service (`/grill-me` on each of S1–S8 + F1–F4) before that service is implemented. 3. Break into issues via `/to-issues` against the project tracker. 4. Stand up the empty `ara/` package skeleton + fakes + first integration-test scaffold as PR-1. 5. Land services in dependency order: domain → repos → fetchers → services → orchestrators → API. Phase 1 milestone gate: first portfolio (Calico or Hyde) routed through the new pipeline end-to-end in June, with a manual spot-check on 5 representative properties to confirm outputs are reasonable. No parity-against-old-engine check — the old engine is dead by then.