mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
initial prd
This commit is contained in:
parent
be133ab70f
commit
fcbaf58a40
1 changed files with 648 additions and 0 deletions
648
ara_backend_design.md
Normal file
648
ara_backend_design.md
Normal file
|
|
@ -0,0 +1,648 @@
|
|||
# ARA Backend Redesign — Design PRD
|
||||
|
||||
**Status**: Draft for team review
|
||||
**Author**: Khalim Conn-Kowlessar (with Claude grill session)
|
||||
**Branch**: `ara-backend-design-prd`
|
||||
**Scope**: Service architecture + domain model + contracts for the new modelling backend. Linked sub-PRDs cover ML training pipeline, DB schema migration, and historical EPC re-mapping.
|
||||
|
||||
---
|
||||
|
||||
## 1. Context
|
||||
|
||||
### 1.1 The forcing function
|
||||
|
||||
The current modelling backend (`backend/engine/engine.py` — `model_engine`, 1331 LOC) was built as an MVP. It is:
|
||||
|
||||
- **Tightly coupled** to a specific gov EPC API that is being **decommissioned on 30 May 2026** (~17 days from today).
|
||||
- **A monolith** — one async function reaches into DB modules, HTTP clients, ML lambdas, S3, and queue infrastructure directly.
|
||||
- **Bottlenecked on a single person** — Khalim is the only contributor able to safely modify the engine because no one else can predict the blast radius of a change.
|
||||
- **Already returning erroneous data** from the old API (clients are aware). The replacement API is partially built (`backend/epc_client/epc_client_service.py`) on the current feature branch.
|
||||
|
||||
### 1.2 What needs to change
|
||||
|
||||
Beyond just swapping API clients, this is the moment to **rebuild the backend into a production-grade, contribute-able codebase**, with:
|
||||
|
||||
- A clear domain model rooted in the new EPC schema (`EpcPropertyData`).
|
||||
- Service boundaries that other team members can read, fix, and extend without needing the entire mental model.
|
||||
- Repository-mediated persistence so business logic can be tested without spinning up a database.
|
||||
- A separation between **data fetching** (slow, IO-heavy, external) and **modelling** (deterministic, fast, internal).
|
||||
|
||||
### 1.3 Out of scope for this PRD
|
||||
|
||||
These ship as **linked sub-PRDs**:
|
||||
|
||||
- **Sub-PRD (ii) — ML training pipeline** (autogluon repo + parquet generation in this repo + scoring model retraining for the new EPC schema)
|
||||
- **Sub-PRD (iii) — DB schema migration** (new tables: `site_notes`, `landlord_overrides`, EPC cache, parallel write strategy)
|
||||
- **Sub-PRD (iv) — Historical EPC re-mapping** (one-off + ongoing batch job: legacy stored EPCs → new `EpcPropertyData` shape)
|
||||
|
||||
The contracts this PRD defines are the inputs each sub-PRD consumes.
|
||||
|
||||
---
|
||||
|
||||
## 2. Goals and non-goals
|
||||
|
||||
### 2.1 Goals
|
||||
|
||||
1. **Survive the 30 May API shutdown** — even if it means a brief degraded window, modelling continues to function against the new gov EPC API.
|
||||
2. **Decouple data fetching from modelling** — modelling never makes external HTTP calls; it reads everything from repositories.
|
||||
3. **Make every service unit-testable against fakes** — no test needs a real DB, a real gov API, or a real ML lambda to verify business logic.
|
||||
4. **Establish a single `Property` aggregate root** as the domain centrepiece; all 9 modelling concerns are slices of one aggregate.
|
||||
5. **Versioned ML data contract** — the EPC-to-features transform is the single shared artifact between this repo and the autogluon repo.
|
||||
6. **Per-property UI surfaces** — fetched data is shown to users for review and override **before** modelling runs; modelling is triggered separately.
|
||||
|
||||
### 2.2 Non-goals
|
||||
|
||||
- Multi-region deploy, GDPR-class data minimisation work, or compliance reporting — separate workstreams.
|
||||
- Replacement of the front-end. The new APIs preserve enough of the existing response shape that the FE migrates incrementally.
|
||||
- Removing pandas. The ML transform output is a parquet-friendly DataFrame-like shape; that stays.
|
||||
- A workflow engine (Prefect / Temporal / Airflow). Coordinator-class orchestration plus the existing SQS-fanout pattern is sufficient at the scale we serve.
|
||||
|
||||
---
|
||||
|
||||
## 3. Cutover strategy
|
||||
|
||||
Two-phase cutover, driven by the 30 May deadline.
|
||||
|
||||
### 3.1 Phase 0 — Stopgap (now → end of May)
|
||||
|
||||
- The current `model_engine` keeps running. `SearchEpc` is rewired to delegate to `EpcClientService` (the new gov API client already built on this branch).
|
||||
- Old-schema EPCs persisted in the DB are read as-is; the EPC re-mapping service is not yet wired in.
|
||||
- Goal: no modelling outage at the API death date. Some degraded behaviour acceptable; clients are aware.
|
||||
|
||||
### 3.2 Phase 1 — Strangler (June → ~Q4 2026)
|
||||
|
||||
- New `ara/` package built alongside the old code. New endpoints expose the new pipeline. The old `model_engine` keeps running.
|
||||
- Per-portfolio feature flag: when set, the trigger endpoint routes the portfolio through the new pipeline. Default is the old pipeline.
|
||||
- Each of the 9 services is built, tested, and ships independently. Adding a service to the new pipeline does not require deleting the old one.
|
||||
- When confidence is high (last portfolio migrated, no regressions seen for N weeks), the old engine is deleted.
|
||||
|
||||
### 3.3 What is **not** done
|
||||
|
||||
- No parallel-shadow run (run both, diff outputs). Reason: doubles compute per plan, requires diff tooling we don't have, and the old engine is already known to return bad data — diffs would be noise.
|
||||
- No big-bang switch. Reason: 9 services is too much change to land in one PR.
|
||||
|
||||
---
|
||||
|
||||
## 4. Architecture overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────────┐
|
||||
│ Trigger endpoint(s) │
|
||||
│ (one or two — see §4.5; deferred decision) │
|
||||
└───────────┬──────────────────────────────────────────┬──────────────┘
|
||||
│ │
|
||||
▼ ▼
|
||||
┌─────────────────┐ ┌─────────────────┐
|
||||
│ IngestionPipe │ SQS, batches of N │ ModellingPipe │
|
||||
│ ----------- │ ◄─────────────────────│ ----------- │
|
||||
│ Fetchers run │ │ Reads via Repos │
|
||||
│ Persist via │ │ Calls Services │
|
||||
│ Repos │ │ ML predictions │
|
||||
└────────┬────────┘ └────────┬────────┘
|
||||
│ │
|
||||
└───────────────► Repos ◄─────────────────┘
|
||||
│
|
||||
▼
|
||||
┌──────────────────┐
|
||||
│ Postgres tables │
|
||||
│ (property, │
|
||||
│ epc_cache, │
|
||||
│ site_notes, │
|
||||
│ landlord_ │
|
||||
│ overrides, │
|
||||
│ plans, etc.) │
|
||||
└──────────────────┘
|
||||
|
||||
┌──────────────────────────┐
|
||||
│ RefreshOrchestrator │ triggers Ingestion → diff → conditionally Modelling
|
||||
└──────────────────────────┘
|
||||
```
|
||||
|
||||
### 4.1 Class taxonomy
|
||||
|
||||
Every class falls into exactly one of four roles:
|
||||
|
||||
| Role | Job | Examples |
|
||||
|------|-----|----------|
|
||||
| **Fetchers** | Call external APIs. Return raw response data. No DB. | `EpcClientService`, `GeospatialFetcher`, `SolarFetcher`, `SiteNotesIngester` |
|
||||
| **Repos** | Persist and load domain aggregates. SQL hidden inside. No external IO. | `PropertyRepo`, `EpcCacheRepo`, `SiteNotesRepo`, `LandlordOverridesRepo`, `RecommendationsRepo`, `GenericDataRepo`, `SubtaskRepo` |
|
||||
| **Services** | Business logic over domain objects. No external IO except via injected Fetchers / Repos. | `EpcRemappingService`, `EpcPredictionService`, `KwhPredictionService`, `ImpactPredictionService`, `RecommendationService`, `OptimiserService`, `FeatureBuilder`, `ResultsPersister` |
|
||||
| **Orchestrators** | Compose Fetchers + Services + Repos to produce an end-to-end result. The only place where step order is encoded. | `IngestionPipeline`, `ModellingPipeline`, `RefreshOrchestrator` |
|
||||
|
||||
This taxonomy is **strict**. A class that fetches *and* persists belongs in the Service layer and depends on a Fetcher + a Repo. No back-channels.
|
||||
|
||||
### 4.2 Two pipelines, one direction
|
||||
|
||||
Data flows one way only: **Ingestion → Repos → Modelling**.
|
||||
|
||||
- **Ingestion** writes; never calls Modelling.
|
||||
- **Modelling** reads; never calls Fetchers.
|
||||
|
||||
If Modelling needs fresh data, it returns "stale" and the caller decides whether to ingest first. This makes Modelling a pure function of repository state, which is the property that makes it reproducible, debuggable, and testable.
|
||||
|
||||
### 4.3 RefreshOrchestrator
|
||||
|
||||
Sits above both pipelines. Job:
|
||||
|
||||
1. Trigger `IngestionPipeline` for a portfolio.
|
||||
2. After ingestion completes, ask repos: "did anything change vs the last modelled snapshot?"
|
||||
3. If yes, trigger `ModellingPipeline`. If no, return early.
|
||||
|
||||
This avoids re-modelling 100k properties when only 200 had refreshed EPC data.
|
||||
|
||||
### 4.4 SQS fanout (preserved from current architecture)
|
||||
|
||||
The existing `trigger_plan_entrypoint` SQS-chunking pattern is kept. Both pipelines fan out per batch of ~30–100 properties (tuneable). Each consumer runs one batch end-to-end through the relevant pipeline.
|
||||
|
||||
UPRN partitioning: the trigger endpoint groups UPRNs by **locality** (postcode prefix / UPRN range) before chunking, so each batch maximises shared upstream fetches (one geospatial-range pull serves all 30 properties in the batch).
|
||||
|
||||
### 4.5 One API or two? (deferred)
|
||||
|
||||
The team will decide at implementation time whether Ingestion and Modelling sit behind:
|
||||
|
||||
- **(a) One unified API** with a single trigger endpoint that runs both phases.
|
||||
- **(b) Two APIs**, each with its own trigger, RefreshOrchestrator chains them.
|
||||
|
||||
Either is workable if the class taxonomy is preserved. Deferred to implementation review.
|
||||
|
||||
---
|
||||
|
||||
## 5. Domain model
|
||||
|
||||
### 5.1 Aggregate root: `Property`
|
||||
|
||||
`Property` is the centrepiece. Every service operates on one or more `Property` instances. Every repo writes one slice of `Property`. The aggregate carries all state for a single property's modelling run.
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class PropertyIdentity:
|
||||
portfolio_id: UUID
|
||||
uprn: Optional[int]
|
||||
landlord_property_id: Optional[str]
|
||||
address: AddressLines
|
||||
postcode: str
|
||||
|
||||
@dataclass
|
||||
class Property:
|
||||
identity: PropertyIdentity
|
||||
|
||||
# --- Source data — modelling path is determined by which of these are set ---
|
||||
epc: Optional[EpcPropertyData] # from gov API (or remapped historical)
|
||||
site_notes: Optional[SiteNotes] # our own survey; supersedes EPC when present
|
||||
landlord_overrides: Optional[LandlordOverrides] # sparse, only meaningful when epc set
|
||||
|
||||
# --- Enrichments ---
|
||||
geospatial: Optional[GeoSpatial]
|
||||
solar: Optional[SolarPotential]
|
||||
epc_anomaly_flags: Optional[EpcAnomalyFlags] # from EpcPredictionService vs neighbours
|
||||
|
||||
# --- Modelling outputs ---
|
||||
baseline_predictions: Optional[BaselinePredictions] # SAP/carbon/heat after rebaselining
|
||||
recommendations: list[Recommendation]
|
||||
impact_predictions: Optional[ImpactPredictions]
|
||||
optimised_package: Optional[OptimisedPackage]
|
||||
|
||||
# --- Derived ---
|
||||
@property
|
||||
def source_path(self) -> Literal["site_notes", "epc_with_overlay"]: ...
|
||||
|
||||
@property
|
||||
def effective_epc(self) -> EpcPropertyData:
|
||||
"""The EPC the modelling pipeline actually scores against."""
|
||||
...
|
||||
```
|
||||
|
||||
### 5.2 `Properties` collection
|
||||
|
||||
A first-class iterable, so batch operations are obvious:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class Properties:
|
||||
items: list[Property]
|
||||
|
||||
def __iter__(self) -> Iterator[Property]: ...
|
||||
def __len__(self) -> int: ...
|
||||
def filter(self, pred: Callable[[Property], bool]) -> "Properties": ...
|
||||
def map(self, fn: Callable[[Property], Property]) -> "Properties": ...
|
||||
def with_landlord_overrides(self) -> "Properties": ...
|
||||
```
|
||||
|
||||
Services typically take and return `Properties`, not lists.
|
||||
|
||||
### 5.3 Other aggregates
|
||||
|
||||
| Aggregate | Owns | Repo |
|
||||
|---|---|---|
|
||||
| `Property` | property identity, epc, site_notes, landlord_overrides, enrichments, modelling results | `PropertyRepo` |
|
||||
| `Plan` | per-property modelling output, scenario membership, plan + recommendations + parts | `RecommendationsRepo` |
|
||||
| `Scenario` | portfolio-wide scenario metadata | `RecommendationsRepo` |
|
||||
| `Subtask` / `Task` | SQS fanout state | `SubtaskRepo` |
|
||||
| `EpcCache` | gov-API responses keyed by UPRN, with freshness/TTL | `EpcCacheRepo` |
|
||||
| `GenericData` | UPRN-range geospatial, postcode lookups, shared static data | `GenericDataRepo` |
|
||||
|
||||
Aggregates are loaded **whole** — never half a `Property`. If a slice is too large to load eagerly (e.g. recommendation history), it lives in a separate aggregate.
|
||||
|
||||
---
|
||||
|
||||
## 6. Source-of-truth and overlay precedence
|
||||
|
||||
There are exactly **two modelling paths**. The `Property.source_path` property selects.
|
||||
|
||||
### 6.1 Path 1 — Site notes
|
||||
|
||||
If a `Property` has `site_notes` and they are newer than any available EPC (or no EPC exists), site notes are the **complete** source of truth:
|
||||
|
||||
- `effective_epc` = `site_notes.to_epc_property_data()`.
|
||||
- EPC fields not covered by site notes — **none expected**. Site notes are committed to being a full-coverage survey. Treat any gap as a survey-quality bug, not a fallback signal.
|
||||
- `LandlordOverrides` are not applicable in Path 1 (the survey supersedes).
|
||||
|
||||
### 6.2 Path 2 — EPC with landlord overlay
|
||||
|
||||
If a `Property` has no site notes (or the EPC is newer):
|
||||
|
||||
- `effective_epc` = `epc` with `landlord_overrides` applied as a sparse field-level overlay (`landlord > epc`).
|
||||
- `LandlordOverrides` are sparse: each row represents one corrected field. Schema TBD at implementation time; assume flat input via Excel/CSV for v1, with a flag to revisit shape after first customer onboarding.
|
||||
|
||||
### 6.3 Recency tie-break
|
||||
|
||||
When a property has **both** site notes and a public EPC, the newer of the two wins. Rationale: a recent EPC may reflect retrofit work done after our survey; conversely a recent survey reflects on-site observations the EPC cannot capture.
|
||||
|
||||
This tie-break is implemented in `Property.source_path` and may be tuned later (e.g. always prefer surveys regardless of date, or per-portfolio policy).
|
||||
|
||||
### 6.4 Rebaselining trigger
|
||||
|
||||
The modelling pipeline re-predicts SAP / carbon / heat / kwh whenever:
|
||||
|
||||
- `effective_epc` differs from the canonical baseline (i.e. raw EPC with no overrides), **or**
|
||||
- The previous modelling snapshot is missing or stale.
|
||||
|
||||
The exact diff mechanism (hash of effective EPC, dirty-flag on overrides, timestamp comparison) is an implementation detail; recommendation is to start with a content hash stored alongside the previous run.
|
||||
|
||||
### 6.5 Deprecated concepts
|
||||
|
||||
- **Patches** (`patch_epc`) — removed. Functionality subsumed by `LandlordOverrides`.
|
||||
- **Already-installed measures** — likely subsumed by `LandlordOverrides` ("we have a heat pump now" → override heating fields). Confirmed at implementation time.
|
||||
- **Non-invasive recommendations** — TBD whether this concept survives; not blocking.
|
||||
|
||||
---
|
||||
|
||||
## 7. Persistence: repositories and unit of work
|
||||
|
||||
### 7.1 What a repository is
|
||||
|
||||
A repository owns the SQL for one aggregate. Nothing else writes SQL for that aggregate. Callers see only domain objects.
|
||||
|
||||
```python
|
||||
class PropertyRepo(Protocol):
|
||||
def get(self, identity: PropertyIdentity) -> Optional[Property]: ...
|
||||
def bulk_save(self, uow: UnitOfWork, properties: Properties) -> None: ...
|
||||
def find_by_portfolio(self, portfolio_id: UUID) -> Properties: ...
|
||||
def find_stale(self, portfolio_id: UUID, threshold: timedelta) -> Properties: ...
|
||||
```
|
||||
|
||||
Implementation references current `db_funcs.*` modules during phase 0 to avoid a big-bang SQL rewrite, but the interface is fixed.
|
||||
|
||||
### 7.2 Unit of Work
|
||||
|
||||
Multi-table writes inside a single aggregate, or across aggregates that share a transaction (e.g. property + plan + recommendations) go through a `UnitOfWork`:
|
||||
|
||||
```python
|
||||
with self.uow_factory() as uow:
|
||||
self.property_repo.bulk_save(uow, properties)
|
||||
self.recommendations_repo.bulk_save(uow, plans)
|
||||
uow.commit()
|
||||
```
|
||||
|
||||
UoW owns the SQLAlchemy session lifecycle. Repos use the session passed in via the UoW. Outside a UoW, repos use a short-lived read session.
|
||||
|
||||
### 7.3 Repository inventory
|
||||
|
||||
| Repo | Tables it owns |
|
||||
|------|----------------|
|
||||
| `PropertyRepo` | `properties`, `property_details_epc`, `property_spatial` |
|
||||
| `EpcCacheRepo` | new table: `epc_api_cache` (TTL, raw API response, mapped `EpcPropertyData`) |
|
||||
| `SiteNotesRepo` | new table: `site_notes` (replaces current `energy_assessments`) |
|
||||
| `LandlordOverridesRepo` | new table: `landlord_overrides` (sparse, per-field rows for audit) |
|
||||
| `RecommendationsRepo` | `plans`, `recommendations`, `recommendation_parts`, `scenarios` |
|
||||
| `GenericDataRepo` | new table or S3-backed: UPRN-range geospatial + postcode-keyed shared static data |
|
||||
| `SubtaskRepo` | `tasks`, `subtasks` (existing) |
|
||||
|
||||
DDL migrations are scoped to sub-PRD (iii).
|
||||
|
||||
### 7.4 Fakes
|
||||
|
||||
For tests, each repo has a `FakeXRepo` companion backed by a dict. Service unit tests inject fakes. No DB required.
|
||||
|
||||
---
|
||||
|
||||
## 8. ML contract
|
||||
|
||||
### 8.1 Where ML lives
|
||||
|
||||
| Concern | Owner |
|
||||
|---|---|
|
||||
| Defining the EPC → features transform | **This repo** (`ara.domain.ml.EpcMlTransform`) |
|
||||
| Loading data, applying transform, writing training parquet to S3 | **This repo** (sub-PRD (ii) batch job) |
|
||||
| Training, hyperparameter search, deployment | **Autogluon repo** |
|
||||
| Scoring at modelling time | **This repo** (`FeatureBuilder` calls `EpcMlTransform`, sends DataFrame to deployed lambda) |
|
||||
|
||||
The autogluon repo is intentionally **dumb**: it consumes parquet, knows which column is the target, knows which columns to ignore. It has no EPC semantics.
|
||||
|
||||
### 8.2 `EpcMlTransform`
|
||||
|
||||
A separate class (not a method on `EpcPropertyData`), because:
|
||||
|
||||
- The data class stays clean of training-infrastructure concerns.
|
||||
- Versioned transforms (`EpcMlTransformV1`, `EpcMlTransformV2`) swap easily.
|
||||
- Future need: injection of normalisation stats from the training set is straightforward on a class, awkward on a dataclass.
|
||||
|
||||
```python
|
||||
class EpcMlTransform:
|
||||
VERSION: str = "1.0.0" # semver
|
||||
|
||||
def to_row(self, epc: EpcPropertyData) -> dict[str, Any]: ...
|
||||
def to_rows(self, properties: Properties) -> pd.DataFrame: ...
|
||||
def schema(self) -> dict[str, type]: ... # for parquet emission + validation
|
||||
```
|
||||
|
||||
The interesting work — flattening `List[SapWindow]`, `List[SapBuildingPart]` into fixed-width columns — lives inside this class. Domain decisions (top-N windows, aggregate roofs, etc.) are encoded here and reviewed by Khalim. Sub-PRD (ii) goes into detail.
|
||||
|
||||
### 8.3 Versioning
|
||||
|
||||
- Transform class is **semver-tagged** (`VERSION = "1.0.0"`).
|
||||
- S3 path for training parquet includes the version: `s3://.../training/v1.0.0/...`.
|
||||
- Deployed scoring lambda is tagged with the transform version it was trained against.
|
||||
- Modelling pipeline asserts at startup that its `EpcMlTransform.VERSION` matches the deployed lambda's tag; mismatch = hard fail at deploy time.
|
||||
|
||||
Bump major when removing or renaming columns. Bump minor when adding optional columns (older models still scoreable; new models can be trained against new fields).
|
||||
|
||||
### 8.4 Two model families, one transform
|
||||
|
||||
Both ML services use the same transform:
|
||||
|
||||
| Service | Lambda | Target |
|
||||
|---|---|---|
|
||||
| `KwhPredictionService` (service #5) | `kwh-models-*` | annual kWh + bills |
|
||||
| `ImpactPredictionService` (service #7) | `impact-models-*` | SAP, carbon, heat demand, post-retrofit kWh |
|
||||
|
||||
The two families are trained against the same input feature schema; only target columns differ. Sub-PRD (ii) handles training-time details.
|
||||
|
||||
---
|
||||
|
||||
## 9. Service catalogue
|
||||
|
||||
Twelve classes implement the modelling pipeline end-to-end. Detailed signatures are deliberately left for implementers — this PRD documents purpose, dependencies, and rough shape.
|
||||
|
||||
### 9.1 Fetchers (called by `IngestionPipeline`)
|
||||
|
||||
| # | Class | Purpose | Dependencies |
|
||||
|---|---|---|---|
|
||||
| F1 | `EpcClientService` | Fetches EPCs from new gov API. Already exists at `backend/epc_client/`. | httpx |
|
||||
| F2 | `GeospatialFetcher` | Fetches UPRN-range geospatial data (replaces `OpenUprnClient` use in current engine). | S3 / Ordnance Survey API |
|
||||
| F3 | `SolarFetcher` | Wraps Google Solar API; building-level + unit-level scenes. | Google Solar API |
|
||||
| F4 | `SiteNotesIngester` | Loads site notes from Excel uploads / structured input. Persists via `SiteNotesRepo`. | S3, repo |
|
||||
|
||||
### 9.2 Domain services (called by `ModellingPipeline`)
|
||||
|
||||
| # | Class | Original-list # | Purpose | Reads | Writes |
|
||||
|---|---|---|---|---|---|
|
||||
| S1 | `EpcRemappingService` | 4 | Re-map legacy / historical EPCs into new `EpcPropertyData` shape. | `EpcCacheRepo` | `EpcCacheRepo` (mapped column) |
|
||||
| S2 | `EpcPredictionService` | 3 | For every property: produce predicted EPC + per-field anomaly flags vs neighbours. Used both for gap-fill (Path 2 if EPC missing) and UI surfacing. | `EpcCacheRepo`, `GenericDataRepo` | — |
|
||||
| S3 | `FeatureBuilder` | (new) | Wraps `EpcMlTransform`. Converts `Properties` → scoring DataFrame. | — | — |
|
||||
| S4 | `KwhPredictionService` | 5 | Calls kWh + bills ML lambda; attaches results to `Property.baseline_predictions` / per-measure. | `FeatureBuilder` | — |
|
||||
| S5 | `RecommendationService` | 6 | Generates per-property recommendations using `effective_epc`, materials, exclusions, etc. Replaces current `Recommendations` (1383 LOC). | `MaterialsRepo` | — |
|
||||
| S6 | `ImpactPredictionService` | 7 | Calls SAP / carbon / heat / bills impact lambda for each recommendation. | `FeatureBuilder` | — |
|
||||
| S7 | `OptimiserService` | 8 | Produces optimised retrofit packages. Wraps current `CostOptimiser` / `GainOptimiser` / `optimise_with_scenarios`. | — | — |
|
||||
| S8 | `ResultsPersister` | 9 | Final step: writes plans, recommendations, property updates via repos under one UoW. | — | All write repos |
|
||||
|
||||
### 9.3 Orchestrators
|
||||
|
||||
| # | Class | Purpose |
|
||||
|---|---|---|
|
||||
| O1 | `IngestionPipeline` | Per-batch SQS consumer. Calls F1–F4, persists via repos. |
|
||||
| O2 | `ModellingPipeline` | Per-batch SQS consumer. Reads from repos, runs S1→S8 in order, ends with persistence. |
|
||||
| O3 | `RefreshOrchestrator` | Top-level: triggers Ingestion → diff → optionally Modelling. |
|
||||
|
||||
### 9.4 `ModellingPipeline` step order
|
||||
|
||||
For each `Property` in the batch:
|
||||
|
||||
```
|
||||
1. PropertyRepo.get() → Property (epc, site_notes, overrides, geospatial, solar)
|
||||
2. EpcRemappingService — if epc is in legacy schema, upgrade to current
|
||||
3. EpcPredictionService — produce predicted EPC + anomaly flags (always runs)
|
||||
4. Compute Property.effective_epc (path-1 or path-2)
|
||||
5. KwhPredictionService — baseline kwh + bills
|
||||
6. RecommendationService — generate candidate measures
|
||||
7. ImpactPredictionService — predict per-measure impact
|
||||
8. OptimiserService — select optimal package
|
||||
9. KwhPredictionService — re-score on optimised package (tenant savings)
|
||||
10. ResultsPersister — write Plan + Recommendations under one UoW
|
||||
```
|
||||
|
||||
Steps 1–4 are per-property. Steps 5–9 batch the whole batch into one ML call where possible (the lambdas accept a DataFrame; today's code already batches).
|
||||
|
||||
### 9.5 Per-service contracts — deferred
|
||||
|
||||
Method signatures, return types, error semantics, and edge-case behaviour are **explicitly out of scope** for this PRD. The implementer of each service runs a `/grill-me` session against this document and produces a detailed sub-design before coding.
|
||||
|
||||
---
|
||||
|
||||
## 10. Cross-batch concerns
|
||||
|
||||
| Concern | Status | Approach |
|
||||
|---|---|---|
|
||||
| Building-level solar adjustment | Deferred — future TODO, not implemented today. | The current `building_ids` block in `model_engine` is dead-ish; it operates on the in-process batch only. New design preserves that limitation. Future feature: a post-modelling consolidation pass that groups results by `building_id` across batches and re-optimises. |
|
||||
| Portfolio aggregation | Dropped. | Front-end computes aggregations dynamically from per-property plans. `extract_portfolio_aggregation_data` in current engine is dead code (defined, never called) — deleting. |
|
||||
| Shared upstream data | Handled by orchestrator partitioning + `GenericDataRepo`. | Trigger endpoint groups UPRNs by postcode / UPRN-range before SQS chunking so each batch maximises intra-batch sharing. `GenericDataRepo` caches across batches so first batch pays, subsequent batches hit cache. |
|
||||
|
||||
---
|
||||
|
||||
## 11. Directory layout
|
||||
|
||||
Proposal — team to tweak.
|
||||
|
||||
```
|
||||
ara/ # new top-level package, sibling of backend/
|
||||
├── domain/
|
||||
│ ├── __init__.py
|
||||
│ ├── property.py # Property aggregate
|
||||
│ ├── properties.py # Properties collection
|
||||
│ ├── identity.py # PropertyIdentity, AddressLines
|
||||
│ ├── site_notes.py # SiteNotes (replaces energy_assessment)
|
||||
│ ├── landlord_overrides.py
|
||||
│ ├── geospatial.py
|
||||
│ ├── solar.py
|
||||
│ ├── recommendations.py # Recommendation, OptimisedPackage
|
||||
│ ├── predictions.py # BaselinePredictions, ImpactPredictions
|
||||
│ ├── anomaly_flags.py # EpcAnomalyFlags
|
||||
│ └── ml/
|
||||
│ ├── __init__.py
|
||||
│ ├── transform.py # EpcMlTransform (versioned)
|
||||
│ └── schema.py # scoring DataFrame schema
|
||||
│
|
||||
├── fetchers/
|
||||
│ ├── __init__.py
|
||||
│ ├── epc_client.py # alias / re-export of backend/epc_client/
|
||||
│ ├── geospatial.py
|
||||
│ ├── solar.py
|
||||
│ └── site_notes_ingester.py
|
||||
│
|
||||
├── repos/
|
||||
│ ├── __init__.py
|
||||
│ ├── unit_of_work.py
|
||||
│ ├── property_repo.py
|
||||
│ ├── epc_cache_repo.py
|
||||
│ ├── site_notes_repo.py
|
||||
│ ├── landlord_overrides_repo.py
|
||||
│ ├── recommendations_repo.py
|
||||
│ ├── generic_data_repo.py
|
||||
│ └── subtask_repo.py
|
||||
│
|
||||
├── services/
|
||||
│ ├── __init__.py
|
||||
│ ├── epc_remapping.py
|
||||
│ ├── epc_prediction.py # nearby-similar + anomaly flags
|
||||
│ ├── feature_builder.py # uses domain.ml.EpcMlTransform
|
||||
│ ├── kwh_prediction.py
|
||||
│ ├── impact_prediction.py
|
||||
│ ├── recommendation.py
|
||||
│ ├── optimiser.py # wraps recommendations/optimiser/
|
||||
│ └── results_persister.py
|
||||
│
|
||||
├── orchestrators/
|
||||
│ ├── __init__.py
|
||||
│ ├── ingestion_pipeline.py
|
||||
│ ├── modelling_pipeline.py
|
||||
│ └── refresh_orchestrator.py
|
||||
│
|
||||
├── api/
|
||||
│ ├── __init__.py
|
||||
│ ├── routers/
|
||||
│ │ ├── ingestion.py # if two APIs
|
||||
│ │ └── modelling.py
|
||||
│ └── schemas/ # request/response Pydantic models
|
||||
│
|
||||
└── tests/
|
||||
├── fakes/ # FakePropertyRepo, FakeEpcClient, etc.
|
||||
├── unit/ # service tests using fakes only
|
||||
└── integration/ # real DB + real SQS via localstack
|
||||
```
|
||||
|
||||
`backend/` continues to host the legacy code during phase 1. Once the last portfolio is migrated, `backend/engine/`, `backend/SearchEpc.py`, `backend/Property.py` are deleted.
|
||||
|
||||
Reused intact (no rewrite needed):
|
||||
|
||||
- `backend/epc_client/` — the new gov API client. Wrapped by `ara/fetchers/epc_client.py`.
|
||||
- `datatypes/epc/domain/` — the new EPC schema. `Property.epc: EpcPropertyData` references it directly.
|
||||
- `recommendations/optimiser/` — wrapped by `ara/services/optimiser.py`.
|
||||
- `backend/app/db/` — repos delegate into `db_funcs.*` until the SQL is rewritten under sub-PRD (iii).
|
||||
|
||||
---
|
||||
|
||||
## 12. Testing strategy
|
||||
|
||||
### 12.1 Unit tests (the bulk)
|
||||
|
||||
Every service test injects fake fetchers and fake repos. No DB, no network, no ML lambda. A service test verifies one slice of logic in 5–30 lines.
|
||||
|
||||
Example:
|
||||
|
||||
```python
|
||||
def test_epc_prediction_flags_anomalous_wall_type():
|
||||
neighbours = [_make_epc(wall_construction="solid") for _ in range(5)]
|
||||
target = _make_property(epc=_make_epc(wall_construction="cavity"))
|
||||
repo = FakeGenericDataRepo(neighbours_by_postcode={target.identity.postcode: neighbours})
|
||||
|
||||
svc = EpcPredictionService(generic_repo=repo)
|
||||
result = svc.run(Properties([target]))
|
||||
|
||||
assert result[0].epc_anomaly_flags.wall_construction == "differs_from_neighbours"
|
||||
```
|
||||
|
||||
### 12.2 Integration tests
|
||||
|
||||
One per pipeline (Ingestion, Modelling, Refresh). Real Postgres (testcontainers or localstack), fake fetchers (hitting recorded fixtures), fake ML lambdas (returning canned predictions). Catches schema / SQL / transaction issues.
|
||||
|
||||
### 12.3 Contract tests
|
||||
|
||||
The transform (`EpcMlTransform`) has its own test suite:
|
||||
|
||||
- Golden file: given a fixed `Property`, output matches an expected DataFrame row exactly.
|
||||
- Schema test: the output columns exactly match a checked-in CSV header (so autogluon team sees breakage on PR).
|
||||
|
||||
### 12.4 What is NOT tested
|
||||
|
||||
- The autogluon repo's training code — owned there.
|
||||
- The gov EPC API behaviour — assumed via the official spec.
|
||||
- Front-end aggregation logic — owned there.
|
||||
|
||||
---
|
||||
|
||||
## 13. Observability
|
||||
|
||||
Each pipeline step emits a **structured log line** at start and end with:
|
||||
|
||||
```
|
||||
{step, property_id, uprn, portfolio_id, subtask_id, duration_ms, outcome, error?}
|
||||
```
|
||||
|
||||
Errors propagate with the `Property.identity` attached, so a portfolio of 100k can be triaged by grep.
|
||||
|
||||
The existing task/subtask state machine is preserved — `IngestionPipeline` and `ModellingPipeline` update subtask status at start (`in progress`), end (`complete` / `failed`), with the CloudWatch log URL attached as today.
|
||||
|
||||
CloudWatch alarms exist on subtask failure rate; thresholds remain unchanged.
|
||||
|
||||
---
|
||||
|
||||
## 14. Data flow: a worked example
|
||||
|
||||
A landlord uploads a corrected heating system for UPRN 12345 via the UI.
|
||||
|
||||
1. **UI** → `POST /properties/12345/overrides` → writes to `landlord_overrides` table via `LandlordOverridesRepo`.
|
||||
2. **RefreshOrchestrator** invoked (either automatically on override-write, or by a "re-model" button). Notes: ingestion is *not* triggered because no external state changed.
|
||||
3. **ModellingPipeline** invoked on a batch of `[12345]`:
|
||||
- Reads `Property(uprn=12345)` from `PropertyRepo`.
|
||||
- `Property.effective_epc` = epc + landlord_overrides → heating system fields differ from baseline.
|
||||
- Rebaselining triggered: `KwhPredictionService` re-predicts baseline SAP / carbon / heat / kwh.
|
||||
- `RecommendationService` regenerates recommendations against the new baseline.
|
||||
- `OptimiserService` re-picks optimal package.
|
||||
- `ResultsPersister` writes new plan under one UoW (old plan is superseded; whether to soft-archive is a sub-PRD (iii) decision).
|
||||
|
||||
Total external calls: zero. The override write is the only thing that hit a network boundary, and that was the inbound HTTP from the UI.
|
||||
|
||||
---
|
||||
|
||||
## 15. Open questions for team review
|
||||
|
||||
1. **One API vs two** (§4.5) — clean interfaces allow either; pick at implementation.
|
||||
2. **`LandlordOverrides` shape** (§6.2) — flat-Excel-shape for v1, with a flag to revisit after first customer.
|
||||
3. **`already_installed` and `non_invasive_recommendations`** (§6.5) — both likely subsumed by overlay, but final call deferred.
|
||||
4. **Recency tie-break policy** (§6.3) — default "newer wins"; team to consider per-portfolio override.
|
||||
5. **`GenericDataRepo` storage backend** — Postgres table, S3, or DynamoDB. Postgres is the path of least infra change; recommend defaulting to that.
|
||||
6. **Soft-archive vs hard-overwrite** for superseded plans (§14) — affects audit / undo behaviour. Defer to sub-PRD (iii).
|
||||
7. **Building-level optimisation as a Phase 2 service** (§10) — agreed deferred; flag for roadmap discussion.
|
||||
8. **Transform versioning policy** (§8.3) — semver chosen; team to confirm bump conventions.
|
||||
|
||||
---
|
||||
|
||||
## 16. Linked sub-PRDs (placeholders)
|
||||
|
||||
- **Sub-PRD (ii) — ML training pipeline** — `docs/sub-prds/ml-training-pipeline.md` (TBC)
|
||||
- **Sub-PRD (iii) — DB schema migration** — `docs/sub-prds/db-schema-migration.md` (TBC)
|
||||
- **Sub-PRD (iv) — Historical EPC re-mapping** — `docs/sub-prds/historical-epc-remap.md` (TBC)
|
||||
|
||||
Each sub-PRD owner: TBC. Each is independently reviewable but consumes the contracts defined in §5 (`Property` aggregate), §7 (repos), §8 (ML transform).
|
||||
|
||||
---
|
||||
|
||||
## 17. Next steps
|
||||
|
||||
1. Team review of this PRD (target: ~1 week).
|
||||
2. Open follow-up grill sessions per service (`/grill-me` on each of S1–S8 + F1–F4) before that service is implemented.
|
||||
3. Break into issues via `/to-issues` against the project tracker.
|
||||
4. Stand up the empty `ara/` package skeleton + fakes + first integration-test scaffold as PR-1.
|
||||
5. Land services in dependency order: domain → repos → fetchers → services → orchestrators → API.
|
||||
|
||||
Phase 1 milestone gate: first portfolio routed through new pipeline with parity against old engine (manual spot-check on 5 representative properties).
|
||||
Loading…
Add table
Reference in a new issue