33 KiB
ARA Backend Redesign — Design PRD
Status: Draft for team review
Author: Khalim Conn-Kowlessar (with Claude grill session)
Branch: ara-backend-design-prd
Scope: Service architecture + domain model + contracts for the new modelling backend. Linked sub-PRDs cover ML training pipeline, DB schema migration, and historical EPC re-mapping.
1. Context
1.1 The forcing function
The current modelling backend (backend/engine/engine.py — model_engine, 1331 LOC) was built as an MVP. It is:
- Tightly coupled to a specific gov EPC API that is being decommissioned on 30 May 2026 (~17 days from today).
- A monolith — one async function reaches into DB modules, HTTP clients, ML lambdas, S3, and queue infrastructure directly.
- Bottlenecked on a single person — Khalim is the only contributor able to safely modify the engine because no one else can predict the blast radius of a change.
- Already returning erroneous data from the old API (clients are aware). The replacement API is partially built (
backend/epc_client/epc_client_service.py) on the current feature branch.
1.2 What needs to change
Beyond just swapping API clients, this is the moment to rebuild the backend into a production-grade, contribute-able codebase, with:
- A clear domain model rooted in the new EPC schema (
EpcPropertyData). - Service boundaries that other team members can read, fix, and extend without needing the entire mental model.
- Repository-mediated persistence so business logic can be tested without spinning up a database.
- A separation between data fetching (slow, IO-heavy, external) and modelling (deterministic, fast, internal).
1.3 Out of scope for this PRD
These ship as linked sub-PRDs:
- Sub-PRD (ii) — ML training pipeline (autogluon repo + parquet generation in this repo + scoring model retraining for the new EPC schema)
- Sub-PRD (iii) — DB schema migration (new tables:
site_notes,landlord_overrides, EPC cache, parallel write strategy) - Sub-PRD (iv) — Historical EPC re-mapping (one-off + ongoing batch job: legacy stored EPCs → new
EpcPropertyDatashape)
The contracts this PRD defines are the inputs each sub-PRD consumes.
2. Goals and non-goals
2.1 Goals
- Survive the 30 May API shutdown — even if it means a brief degraded window, modelling continues to function against the new gov EPC API.
- Decouple data fetching from modelling — modelling never makes external HTTP calls; it reads everything from repositories.
- Make every service unit-testable against fakes — no test needs a real DB, a real gov API, or a real ML lambda to verify business logic.
- Establish a single
Propertyaggregate root as the domain centrepiece; all 9 modelling concerns are slices of one aggregate. - Versioned ML data contract — the EPC-to-features transform is the single shared artifact between this repo and the autogluon repo.
- Per-property UI surfaces — fetched data is shown to users for review and override before modelling runs; modelling is triggered separately.
2.2 Non-goals
- Multi-region deploy, GDPR-class data minimisation work, or compliance reporting — separate workstreams.
- Replacement of the front-end. The new APIs preserve enough of the existing response shape that the FE migrates incrementally.
- Removing pandas. The ML transform output is a parquet-friendly DataFrame-like shape; that stays.
- A workflow engine (Prefect / Temporal / Airflow). Coordinator-class orchestration plus the existing SQS-fanout pattern is sufficient at the scale we serve.
3. Cutover strategy
Two-phase cutover, driven by the 30 May deadline.
3.1 Phase 0 — Stopgap (now → end of May)
- The current
model_enginekeeps running.SearchEpcis rewired to delegate toEpcClientService(the new gov API client already built on this branch). - Old-schema EPCs persisted in the DB are read as-is; the EPC re-mapping service is not yet wired in.
- Goal: no modelling outage at the API death date. Some degraded behaviour acceptable; clients are aware.
3.2 Phase 1 — Strangler (June → ~Q4 2026)
- New
ara/package built alongside the old code. New endpoints expose the new pipeline. The oldmodel_enginekeeps running. - Per-portfolio feature flag: when set, the trigger endpoint routes the portfolio through the new pipeline. Default is the old pipeline.
- Each of the 9 services is built, tested, and ships independently. Adding a service to the new pipeline does not require deleting the old one.
- When confidence is high (last portfolio migrated, no regressions seen for N weeks), the old engine is deleted.
3.3 What is not done
- No parallel-shadow run (run both, diff outputs). Reason: doubles compute per plan, requires diff tooling we don't have, and the old engine is already known to return bad data — diffs would be noise.
- No big-bang switch. Reason: 9 services is too much change to land in one PR.
4. Architecture overview
┌─────────────────────────────────────────────────────────────────────┐
│ Trigger endpoint(s) │
│ (one or two — see §4.5; deferred decision) │
└───────────┬──────────────────────────────────────────┬──────────────┘
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────┐
│ IngestionPipe │ SQS, batches of N │ ModellingPipe │
│ ----------- │ ◄─────────────────────│ ----------- │
│ Fetchers run │ │ Reads via Repos │
│ Persist via │ │ Calls Services │
│ Repos │ │ ML predictions │
└────────┬────────┘ └────────┬────────┘
│ │
└───────────────► Repos ◄─────────────────┘
│
▼
┌──────────────────┐
│ Postgres tables │
│ (property, │
│ epc_cache, │
│ site_notes, │
│ landlord_ │
│ overrides, │
│ plans, etc.) │
└──────────────────┘
┌──────────────────────────┐
│ RefreshOrchestrator │ triggers Ingestion → diff → conditionally Modelling
└──────────────────────────┘
4.1 Class taxonomy
Every class falls into exactly one of four roles:
| Role | Job | Examples |
|---|---|---|
| Fetchers | Call external APIs. Return raw response data. No DB. | EpcClientService, GeospatialFetcher, SolarFetcher, SiteNotesIngester |
| Repos | Persist and load domain aggregates. SQL hidden inside. No external IO. | PropertyRepo, EpcCacheRepo, SiteNotesRepo, LandlordOverridesRepo, RecommendationsRepo, GenericDataRepo, SubtaskRepo |
| Services | Business logic over domain objects. No external IO except via injected Fetchers / Repos. | EpcRemappingService, EpcPredictionService, KwhPredictionService, ImpactPredictionService, RecommendationService, OptimiserService, FeatureBuilder, ResultsPersister |
| Orchestrators | Compose Fetchers + Services + Repos to produce an end-to-end result. The only place where step order is encoded. | IngestionPipeline, ModellingPipeline, RefreshOrchestrator |
This taxonomy is strict. A class that fetches and persists belongs in the Service layer and depends on a Fetcher + a Repo. No back-channels.
4.2 Two pipelines, one direction
Data flows one way only: Ingestion → Repos → Modelling.
- Ingestion writes; never calls Modelling.
- Modelling reads; never calls Fetchers.
If Modelling needs fresh data, it returns "stale" and the caller decides whether to ingest first. This makes Modelling a pure function of repository state, which is the property that makes it reproducible, debuggable, and testable.
4.3 RefreshOrchestrator
Sits above both pipelines. Job:
- Trigger
IngestionPipelinefor a portfolio. - After ingestion completes, ask repos: "did anything change vs the last modelled snapshot?"
- If yes, trigger
ModellingPipeline. If no, return early.
This avoids re-modelling 100k properties when only 200 had refreshed EPC data.
4.4 SQS fanout (preserved from current architecture)
The existing trigger_plan_entrypoint SQS-chunking pattern is kept. Both pipelines fan out per batch of ~30–100 properties (tuneable). Each consumer runs one batch end-to-end through the relevant pipeline.
UPRN partitioning: the trigger endpoint groups UPRNs by locality (postcode prefix / UPRN range) before chunking, so each batch maximises shared upstream fetches (one geospatial-range pull serves all 30 properties in the batch).
4.5 One API or two? (deferred)
The team will decide at implementation time whether Ingestion and Modelling sit behind:
- (a) One unified API with a single trigger endpoint that runs both phases.
- (b) Two APIs, each with its own trigger, RefreshOrchestrator chains them.
Either is workable if the class taxonomy is preserved. Deferred to implementation review.
5. Domain model
5.1 Aggregate root: Property
Property is the centrepiece. Every service operates on one or more Property instances. Every repo writes one slice of Property. The aggregate carries all state for a single property's modelling run.
@dataclass
class PropertyIdentity:
portfolio_id: UUID
uprn: Optional[int]
landlord_property_id: Optional[str]
address: AddressLines
postcode: str
@dataclass
class Property:
identity: PropertyIdentity
# --- Source data — modelling path is determined by which of these are set ---
epc: Optional[EpcPropertyData] # from gov API (or remapped historical)
site_notes: Optional[SiteNotes] # our own survey; supersedes EPC when present
landlord_overrides: Optional[LandlordOverrides] # sparse, only meaningful when epc set
# --- Enrichments ---
geospatial: Optional[GeoSpatial]
solar: Optional[SolarPotential]
epc_anomaly_flags: Optional[EpcAnomalyFlags] # from EpcPredictionService vs neighbours
# --- Modelling outputs ---
baseline_predictions: Optional[BaselinePredictions] # SAP/carbon/heat after rebaselining
recommendations: list[Recommendation]
impact_predictions: Optional[ImpactPredictions]
optimised_package: Optional[OptimisedPackage]
# --- Derived ---
@property
def source_path(self) -> Literal["site_notes", "epc_with_overlay"]: ...
@property
def effective_epc(self) -> EpcPropertyData:
"""The EPC the modelling pipeline actually scores against."""
...
5.2 Properties collection
A first-class iterable, so batch operations are obvious:
@dataclass
class Properties:
items: list[Property]
def __iter__(self) -> Iterator[Property]: ...
def __len__(self) -> int: ...
def filter(self, pred: Callable[[Property], bool]) -> "Properties": ...
def map(self, fn: Callable[[Property], Property]) -> "Properties": ...
def with_landlord_overrides(self) -> "Properties": ...
Services typically take and return Properties, not lists.
5.3 Other aggregates
| Aggregate | Owns | Repo |
|---|---|---|
Property |
property identity, epc, site_notes, landlord_overrides, enrichments, modelling results | PropertyRepo |
Plan |
per-property modelling output, scenario membership, plan + recommendations + parts | RecommendationsRepo |
Scenario |
portfolio-wide scenario metadata | RecommendationsRepo |
Subtask / Task |
SQS fanout state | SubtaskRepo |
EpcCache |
gov-API responses keyed by UPRN, with freshness/TTL | EpcCacheRepo |
GenericData |
UPRN-range geospatial, postcode lookups, shared static data | GenericDataRepo |
Aggregates are loaded whole — never half a Property. If a slice is too large to load eagerly (e.g. recommendation history), it lives in a separate aggregate.
6. Source-of-truth and overlay precedence
There are exactly two modelling paths. The Property.source_path property selects.
6.1 Path 1 — Site notes
If a Property has site_notes and they are newer than any available EPC (or no EPC exists), site notes are the complete source of truth:
effective_epc=site_notes.to_epc_property_data().- EPC fields not covered by site notes — none expected. Site notes are committed to being a full-coverage survey. Treat any gap as a survey-quality bug, not a fallback signal.
LandlordOverridesare not applicable in Path 1 (the survey supersedes).
6.2 Path 2 — EPC with landlord overlay
If a Property has no site notes (or the EPC is newer):
effective_epc=epcwithlandlord_overridesapplied as a sparse field-level overlay (landlord > epc).LandlordOverridesare sparse: each row represents one corrected field. Schema TBD at implementation time; assume flat input via Excel/CSV for v1, with a flag to revisit shape after first customer onboarding.
6.3 Recency tie-break
When a property has both site notes and a public EPC, the newer of the two wins. Rationale: a recent EPC may reflect retrofit work done after our survey; conversely a recent survey reflects on-site observations the EPC cannot capture.
This tie-break is implemented in Property.source_path and may be tuned later (e.g. always prefer surveys regardless of date, or per-portfolio policy).
6.4 Rebaselining trigger
The modelling pipeline re-predicts SAP / carbon / heat / kwh whenever:
effective_epcdiffers from the canonical baseline (i.e. raw EPC with no overrides), or- The previous modelling snapshot is missing or stale.
The exact diff mechanism (hash of effective EPC, dirty-flag on overrides, timestamp comparison) is an implementation detail; recommendation is to start with a content hash stored alongside the previous run.
6.5 Deprecated concepts
- Patches (
patch_epc) — removed. Functionality subsumed byLandlordOverrides. - Already-installed measures — likely subsumed by
LandlordOverrides("we have a heat pump now" → override heating fields). Confirmed at implementation time. - Non-invasive recommendations — TBD whether this concept survives; not blocking.
7. Persistence: repositories and unit of work
7.1 What a repository is
A repository owns the SQL for one aggregate. Nothing else writes SQL for that aggregate. Callers see only domain objects.
class PropertyRepo(Protocol):
def get(self, identity: PropertyIdentity) -> Optional[Property]: ...
def bulk_save(self, uow: UnitOfWork, properties: Properties) -> None: ...
def find_by_portfolio(self, portfolio_id: UUID) -> Properties: ...
def find_stale(self, portfolio_id: UUID, threshold: timedelta) -> Properties: ...
Implementation references current db_funcs.* modules during phase 0 to avoid a big-bang SQL rewrite, but the interface is fixed.
7.2 Unit of Work
Multi-table writes inside a single aggregate, or across aggregates that share a transaction (e.g. property + plan + recommendations) go through a UnitOfWork:
with self.uow_factory() as uow:
self.property_repo.bulk_save(uow, properties)
self.recommendations_repo.bulk_save(uow, plans)
uow.commit()
UoW owns the SQLAlchemy session lifecycle. Repos use the session passed in via the UoW. Outside a UoW, repos use a short-lived read session.
7.3 Repository inventory
| Repo | Tables it owns |
|---|---|
PropertyRepo |
properties, property_details_epc, property_spatial |
EpcCacheRepo |
new table: epc_api_cache (TTL, raw API response, mapped EpcPropertyData) |
SiteNotesRepo |
new table: site_notes (replaces current energy_assessments) |
LandlordOverridesRepo |
new table: landlord_overrides (sparse, per-field rows for audit) |
RecommendationsRepo |
plans, recommendations, recommendation_parts, scenarios |
GenericDataRepo |
new table or S3-backed: UPRN-range geospatial + postcode-keyed shared static data |
SubtaskRepo |
tasks, subtasks (existing) |
DDL migrations are scoped to sub-PRD (iii).
7.4 Fakes
For tests, each repo has a FakeXRepo companion backed by a dict. Service unit tests inject fakes. No DB required.
8. ML contract
8.1 Where ML lives
| Concern | Owner |
|---|---|
| Defining the EPC → features transform | This repo (ara.domain.ml.EpcMlTransform) |
| Loading data, applying transform, writing training parquet to S3 | This repo (sub-PRD (ii) batch job) |
| Training, hyperparameter search, deployment | Autogluon repo |
| Scoring at modelling time | This repo (FeatureBuilder calls EpcMlTransform, sends DataFrame to deployed lambda) |
The autogluon repo is intentionally dumb: it consumes parquet, knows which column is the target, knows which columns to ignore. It has no EPC semantics.
8.2 EpcMlTransform
A separate class (not a method on EpcPropertyData), because:
- The data class stays clean of training-infrastructure concerns.
- Versioned transforms (
EpcMlTransformV1,EpcMlTransformV2) swap easily. - Future need: injection of normalisation stats from the training set is straightforward on a class, awkward on a dataclass.
class EpcMlTransform:
VERSION: str = "1.0.0" # semver
def to_row(self, epc: EpcPropertyData) -> dict[str, Any]: ...
def to_rows(self, properties: Properties) -> pd.DataFrame: ...
def schema(self) -> dict[str, type]: ... # for parquet emission + validation
The interesting work — flattening List[SapWindow], List[SapBuildingPart] into fixed-width columns — lives inside this class. Domain decisions (top-N windows, aggregate roofs, etc.) are encoded here and reviewed by Khalim. Sub-PRD (ii) goes into detail.
8.3 Versioning
- Transform class is semver-tagged (
VERSION = "1.0.0"). - S3 path for training parquet includes the version:
s3://.../training/v1.0.0/.... - Deployed scoring lambda is tagged with the transform version it was trained against.
- Modelling pipeline asserts at startup that its
EpcMlTransform.VERSIONmatches the deployed lambda's tag; mismatch = hard fail at deploy time.
Bump major when removing or renaming columns. Bump minor when adding optional columns (older models still scoreable; new models can be trained against new fields).
8.4 Two model families, one transform
Both ML services use the same transform:
| Service | Lambda | Target |
|---|---|---|
KwhPredictionService (service #5) |
kwh-models-* |
annual kWh + bills |
ImpactPredictionService (service #7) |
impact-models-* |
SAP, carbon, heat demand, post-retrofit kWh |
The two families are trained against the same input feature schema; only target columns differ. Sub-PRD (ii) handles training-time details.
9. Service catalogue
Twelve classes implement the modelling pipeline end-to-end. Detailed signatures are deliberately left for implementers — this PRD documents purpose, dependencies, and rough shape.
9.1 Fetchers (called by IngestionPipeline)
| # | Class | Purpose | Dependencies |
|---|---|---|---|
| F1 | EpcClientService |
Fetches EPCs from new gov API. Already exists at backend/epc_client/. |
httpx |
| F2 | GeospatialFetcher |
Fetches UPRN-range geospatial data (replaces OpenUprnClient use in current engine). |
S3 / Ordnance Survey API |
| F3 | SolarFetcher |
Wraps Google Solar API; building-level + unit-level scenes. | Google Solar API |
| F4 | SiteNotesIngester |
Loads site notes from Excel uploads / structured input. Persists via SiteNotesRepo. |
S3, repo |
9.2 Domain services (called by ModellingPipeline)
| # | Class | Original-list # | Purpose | Reads | Writes |
|---|---|---|---|---|---|
| S1 | EpcRemappingService |
4 | Re-map legacy / historical EPCs into new EpcPropertyData shape. |
EpcCacheRepo |
EpcCacheRepo (mapped column) |
| S2 | EpcPredictionService |
3 | For every property: produce predicted EPC + per-field anomaly flags vs neighbours. Used both for gap-fill (Path 2 if EPC missing) and UI surfacing. | EpcCacheRepo, GenericDataRepo |
— |
| S3 | FeatureBuilder |
(new) | Wraps EpcMlTransform. Converts Properties → scoring DataFrame. |
— | — |
| S4 | KwhPredictionService |
5 | Calls kWh + bills ML lambda; attaches results to Property.baseline_predictions / per-measure. |
FeatureBuilder |
— |
| S5 | RecommendationService |
6 | Generates per-property recommendations using effective_epc, materials, exclusions, etc. Replaces current Recommendations (1383 LOC). |
MaterialsRepo |
— |
| S6 | ImpactPredictionService |
7 | Calls SAP / carbon / heat / bills impact lambda for each recommendation. | FeatureBuilder |
— |
| S7 | OptimiserService |
8 | Produces optimised retrofit packages. Wraps current CostOptimiser / GainOptimiser / optimise_with_scenarios. |
— | — |
| S8 | ResultsPersister |
9 | Final step: writes plans, recommendations, property updates via repos under one UoW. | — | All write repos |
9.3 Orchestrators
| # | Class | Purpose |
|---|---|---|
| O1 | IngestionPipeline |
Per-batch SQS consumer. Calls F1–F4, persists via repos. |
| O2 | ModellingPipeline |
Per-batch SQS consumer. Reads from repos, runs S1→S8 in order, ends with persistence. |
| O3 | RefreshOrchestrator |
Top-level: triggers Ingestion → diff → optionally Modelling. |
9.4 ModellingPipeline step order
For each Property in the batch:
1. PropertyRepo.get() → Property (epc, site_notes, overrides, geospatial, solar)
2. EpcRemappingService — if epc is in legacy schema, upgrade to current
3. EpcPredictionService — produce predicted EPC + anomaly flags (always runs)
4. Compute Property.effective_epc (path-1 or path-2)
5. KwhPredictionService — baseline kwh + bills
6. RecommendationService — generate candidate measures
7. ImpactPredictionService — predict per-measure impact
8. OptimiserService — select optimal package
9. KwhPredictionService — re-score on optimised package (tenant savings)
10. ResultsPersister — write Plan + Recommendations under one UoW
Steps 1–4 are per-property. Steps 5–9 batch the whole batch into one ML call where possible (the lambdas accept a DataFrame; today's code already batches).
9.5 Per-service contracts — deferred
Method signatures, return types, error semantics, and edge-case behaviour are explicitly out of scope for this PRD. The implementer of each service runs a /grill-me session against this document and produces a detailed sub-design before coding.
10. Cross-batch concerns
| Concern | Status | Approach |
|---|---|---|
| Building-level solar adjustment | Deferred — future TODO, not implemented today. | The current building_ids block in model_engine is dead-ish; it operates on the in-process batch only. New design preserves that limitation. Future feature: a post-modelling consolidation pass that groups results by building_id across batches and re-optimises. |
| Portfolio aggregation | Dropped. | Front-end computes aggregations dynamically from per-property plans. extract_portfolio_aggregation_data in current engine is dead code (defined, never called) — deleting. |
| Shared upstream data | Handled by orchestrator partitioning + GenericDataRepo. |
Trigger endpoint groups UPRNs by postcode / UPRN-range before SQS chunking so each batch maximises intra-batch sharing. GenericDataRepo caches across batches so first batch pays, subsequent batches hit cache. |
11. Directory layout
Proposal — team to tweak.
ara/ # new top-level package, sibling of backend/
├── domain/
│ ├── __init__.py
│ ├── property.py # Property aggregate
│ ├── properties.py # Properties collection
│ ├── identity.py # PropertyIdentity, AddressLines
│ ├── site_notes.py # SiteNotes (replaces energy_assessment)
│ ├── landlord_overrides.py
│ ├── geospatial.py
│ ├── solar.py
│ ├── recommendations.py # Recommendation, OptimisedPackage
│ ├── predictions.py # BaselinePredictions, ImpactPredictions
│ ├── anomaly_flags.py # EpcAnomalyFlags
│ └── ml/
│ ├── __init__.py
│ ├── transform.py # EpcMlTransform (versioned)
│ └── schema.py # scoring DataFrame schema
│
├── fetchers/
│ ├── __init__.py
│ ├── epc_client.py # alias / re-export of backend/epc_client/
│ ├── geospatial.py
│ ├── solar.py
│ └── site_notes_ingester.py
│
├── repos/
│ ├── __init__.py
│ ├── unit_of_work.py
│ ├── property_repo.py
│ ├── epc_cache_repo.py
│ ├── site_notes_repo.py
│ ├── landlord_overrides_repo.py
│ ├── recommendations_repo.py
│ ├── generic_data_repo.py
│ └── subtask_repo.py
│
├── services/
│ ├── __init__.py
│ ├── epc_remapping.py
│ ├── epc_prediction.py # nearby-similar + anomaly flags
│ ├── feature_builder.py # uses domain.ml.EpcMlTransform
│ ├── kwh_prediction.py
│ ├── impact_prediction.py
│ ├── recommendation.py
│ ├── optimiser.py # wraps recommendations/optimiser/
│ └── results_persister.py
│
├── orchestrators/
│ ├── __init__.py
│ ├── ingestion_pipeline.py
│ ├── modelling_pipeline.py
│ └── refresh_orchestrator.py
│
├── api/
│ ├── __init__.py
│ ├── routers/
│ │ ├── ingestion.py # if two APIs
│ │ └── modelling.py
│ └── schemas/ # request/response Pydantic models
│
└── tests/
├── fakes/ # FakePropertyRepo, FakeEpcClient, etc.
├── unit/ # service tests using fakes only
└── integration/ # real DB + real SQS via localstack
backend/ continues to host the legacy code during phase 1. Once the last portfolio is migrated, backend/engine/, backend/SearchEpc.py, backend/Property.py are deleted.
Reused intact (no rewrite needed):
backend/epc_client/— the new gov API client. Wrapped byara/fetchers/epc_client.py.datatypes/epc/domain/— the new EPC schema.Property.epc: EpcPropertyDatareferences it directly.recommendations/optimiser/— wrapped byara/services/optimiser.py.backend/app/db/— repos delegate intodb_funcs.*until the SQL is rewritten under sub-PRD (iii).
12. Testing strategy
12.1 Unit tests (the bulk)
Every service test injects fake fetchers and fake repos. No DB, no network, no ML lambda. A service test verifies one slice of logic in 5–30 lines.
Example:
def test_epc_prediction_flags_anomalous_wall_type():
neighbours = [_make_epc(wall_construction="solid") for _ in range(5)]
target = _make_property(epc=_make_epc(wall_construction="cavity"))
repo = FakeGenericDataRepo(neighbours_by_postcode={target.identity.postcode: neighbours})
svc = EpcPredictionService(generic_repo=repo)
result = svc.run(Properties([target]))
assert result[0].epc_anomaly_flags.wall_construction == "differs_from_neighbours"
12.2 Integration tests
One per pipeline (Ingestion, Modelling, Refresh). Real Postgres (testcontainers or localstack), fake fetchers (hitting recorded fixtures), fake ML lambdas (returning canned predictions). Catches schema / SQL / transaction issues.
12.3 Contract tests
The transform (EpcMlTransform) has its own test suite:
- Golden file: given a fixed
Property, output matches an expected DataFrame row exactly. - Schema test: the output columns exactly match a checked-in CSV header (so autogluon team sees breakage on PR).
12.4 What is NOT tested
- The autogluon repo's training code — owned there.
- The gov EPC API behaviour — assumed via the official spec.
- Front-end aggregation logic — owned there.
13. Observability
Each pipeline step emits a structured log line at start and end with:
{step, property_id, uprn, portfolio_id, subtask_id, duration_ms, outcome, error?}
Errors propagate with the Property.identity attached, so a portfolio of 100k can be triaged by grep.
The existing task/subtask state machine is preserved — IngestionPipeline and ModellingPipeline update subtask status at start (in progress), end (complete / failed), with the CloudWatch log URL attached as today.
CloudWatch alarms exist on subtask failure rate; thresholds remain unchanged.
14. Data flow: a worked example
A landlord uploads a corrected heating system for UPRN 12345 via the UI.
- UI →
POST /properties/12345/overrides→ writes tolandlord_overridestable viaLandlordOverridesRepo. - RefreshOrchestrator invoked (either automatically on override-write, or by a "re-model" button). Notes: ingestion is not triggered because no external state changed.
- ModellingPipeline invoked on a batch of
[12345]:- Reads
Property(uprn=12345)fromPropertyRepo. Property.effective_epc= epc + landlord_overrides → heating system fields differ from baseline.- Rebaselining triggered:
KwhPredictionServicere-predicts baseline SAP / carbon / heat / kwh. RecommendationServiceregenerates recommendations against the new baseline.OptimiserServicere-picks optimal package.ResultsPersisterwrites new plan under one UoW (old plan is superseded; whether to soft-archive is a sub-PRD (iii) decision).
- Reads
Total external calls: zero. The override write is the only thing that hit a network boundary, and that was the inbound HTTP from the UI.
15. Open questions for team review
- One API vs two (§4.5) — clean interfaces allow either; pick at implementation.
LandlordOverridesshape (§6.2) — flat-Excel-shape for v1, with a flag to revisit after first customer.already_installedandnon_invasive_recommendations(§6.5) — both likely subsumed by overlay, but final call deferred.- Recency tie-break policy (§6.3) — default "newer wins"; team to consider per-portfolio override.
GenericDataRepostorage backend — Postgres table, S3, or DynamoDB. Postgres is the path of least infra change; recommend defaulting to that.- Soft-archive vs hard-overwrite for superseded plans (§14) — affects audit / undo behaviour. Defer to sub-PRD (iii).
- Building-level optimisation as a Phase 2 service (§10) — agreed deferred; flag for roadmap discussion.
- Transform versioning policy (§8.3) — semver chosen; team to confirm bump conventions.
16. Linked sub-PRDs (placeholders)
- Sub-PRD (ii) — ML training pipeline —
docs/sub-prds/ml-training-pipeline.md(TBC) - Sub-PRD (iii) — DB schema migration —
docs/sub-prds/db-schema-migration.md(TBC) - Sub-PRD (iv) — Historical EPC re-mapping —
docs/sub-prds/historical-epc-remap.md(TBC)
Each sub-PRD owner: TBC. Each is independently reviewable but consumes the contracts defined in §5 (Property aggregate), §7 (repos), §8 (ML transform).
17. Next steps
- Team review of this PRD (target: ~1 week).
- Open follow-up grill sessions per service (
/grill-meon each of S1–S8 + F1–F4) before that service is implemented. - Break into issues via
/to-issuesagainst the project tracker. - Stand up the empty
ara/package skeleton + fakes + first integration-test scaffold as PR-1. - Land services in dependency order: domain → repos → fetchers → services → orchestrators → API.
Phase 1 milestone gate: first portfolio routed through new pipeline with parity against old engine (manual spot-check on 5 representative properties).