18 KiB
Design WIP: bulk_upload_finaliser + property_overrides
Status: v1 fully resolved (grilling 2026-06-04). Ready to graduate to ADR(s). v2 (
property_overridespopulation) deferred to its own session — see the "Input" application-flow item for its entry point. When decisions stabilise this should graduate into a new ADR indocs/adr/(frontend) and likely a companion ADR in the Model repo, plus a CONTEXT.md update (see "Docs to update").
Goal
Two linked pieces of work:
-
New backend application
bulk_upload_finaliser(lives in/workspaces/home/github/Model/applications/, DDD-aligned — study/workspaces/home/github/Model/domain). It reads the address-matching / combiner output and thelandlord_*_overridesvocabulary tables, then writes Postgres correctly: thepropertyrows (UPRN + address, as the frontend does today) and — later — the newproperty_overridesrows. Motivation: a property list can be ~40,000 rows, too big for a synchronous Next.js HTTP handler. -
New
property_overridestable — the per-Property fact layer that ADR-0004 explicitly deferred ("the per-Property building-part fact layer that consumesmultiEntryOrderingand writes main/extension facts at finalise"). One row per(property, building_part, component)carrying the resolved enum value + provenance.
Split into two pieces (decided 2026-06-04):
- v1 — async finaliser writes
property. Move today's synchronous Next.js/finalizeproperty-insert into a dispatched Lambda (bulk_upload_finaliser), because a property list can be ~40,000 rows. Reproduces the exact 9-column insertonConflictDoNothing, adds thefinalisingstatus + async state machine, and shifts terminal-status ownership to the backend. Fully designed — ADR-ready.
- v2 — populate
property_overrides. The per-Property fact layer. The table already shipped (migration 0221, PR #306), but population is a separate follow-up with its own open input-plumbing questions (see the "Input" item under application-flow questions). Not designed here.
This doc resolves v1 in full; v2 gets its own grilling session against real classifier-CSV / combiner-output samples.
Where this sits in the existing pipeline
BulkUpload → address matching → combiner → awaiting_review → [Finalise]
│
(new) bulk_upload_finaliser ──────────┘
reads: combiner output (S3) + landlord_*_overrides
writes: property (+ later property_overrides)
│
downstream: Ingestion (EPC/solar fetch)
→ PropertyBaseline (stage 2,
re-score-on-override seam,
Model ADR-0011/0012)
Finalise (the user action + state-machine gate) stays in Next.js; the new
application is the worker it dispatches. Downstream PropertyBaseline
already has an override-aware "re-score" seam — property_overrides will feed it.
Decisions locked
| # | Decision |
|---|---|
| Name | Application is bulk_upload_finaliser, in Model/applications/. Finalise stays the Next.js action that triggers it. |
| DDD | Follow the DDD structure under Model/domain. Domain terms discovered as needed. |
| Schema ownership | Drizzle (frontend) owns migrations for both property and the new property_overrides. |
| Backend access | Backend gets a PropertyOverrideRow SQLModel (mirror, like landlord_wall_type_override_table.py) + a repository (see Model/infrastructure/postgres + Model/repositories for examples). PropertyRow must drop its "backend never inserts" invariant and gain insertable columns. |
Next.js /finalize |
Delete it — fully replaced by the Lambda. |
property_overrides shape |
Single polymorphic table, not per-component tables. Accepts losing DB-level pgEnum typing on value. |
override_value |
text — a denormalised snapshot copy (own value per row) of the resolved enum from landlord_*_overrides at materialise time. Own-value (not an FK to the vocabulary) is what lets two properties sharing a description later diverge, and lets re-run recalculate one property's value without touching its siblings. |
| Snapshot, not FK to vocabulary | property_overrides does not foreign-key the originating landlord_*_overrides row. An FK forces every property sharing a description to share one value (forbids divergence); is structurally impossible as a real FK (4 polymorphic target tables, each with its own value enum); and would risk cascade-deleting per-property facts when re-classification prunes a vocabulary row. Lineage is preserved as a natural key — (portfolio_id, override_component, original_spreadsheet_description) re-finds the vocabulary row (its UNIQUE is (portfolio_id, description)) — so deliberate re-sync needs no surrogate FK. |
| Re-run = recalculate | The finaliser write to property_overrides is onConflictDoUpdate on (property_id, override_component, building_part), refreshing override_value + original_spreadsheet_description + updated_at to the latest resolution. Contrast property, which stays onConflictDoNothing (identity rows, don't churn). When per-property source='user' edits exist, the update must guard WHERE source='classifier' to preserve hand-edits (mirrors the Model classifier upsert). |
building_part |
smallint NOT NULL, explicit index: 0 = main building, 1 = extension 1, 2 = extension 2, … (matches ADR-0004 multiEntryOrdering.permutations indexing). |
| Whole-dwelling components | No special case. property_type/built_form are per-part-capable too (an extension — conservatory, summer house — can be a different built form / property type). Today's files only supply them once, so they'll usually be written at building_part = 0 only, but the schema allows per-part with no future migration. |
property_overrides — columns so far
Roughly (subject to remaining open questions):
property_overrides
id uuid pk (default random) -- match landlord_* tables
property_id bigint NOT NULL FK → property.id (FE-owned table)
portfolio_id bigint NOT NULL FK → portfolio.id
building_part smallint NOT NULL -- 0 = main, 1 = ext 1, 2 = ext 2, …
override_component override_component NOT NULL -- column name == enum type name; pgEnum {wall_type, roof_type, property_type, built_form_type} (Q6 ✓)
override_value text NOT NULL -- snapshot copy of landlord_* resolved enum (free text; `override_component` carries the typing)
-- (no `source`) — dropped Q9: pure value snapshot; add back as nullable column if/when a per-property edit path needs provenance
original_spreadsheet_description text NOT NULL -- raw spreadsheet cell text this snapshot resolved from (Q7 ✓)
created_at timestamptz NOT NULL default now()
updated_at timestamptz NOT NULL default now()
-- UNIQUE (property_id, override_component, building_part) -- Q8 ✓ (source NOT in key — mirrors ADR-0004 single-row flip; portfolio_id implied by property_id)
-- FK property_id → property.id ON DELETE CASCADE; portfolio_id → portfolio.id (Drizzle only; bare bigint in SQLModel mirror); portfolio_id kept (matches property_details_epc / property_targets)
Open questions (resume here)
-
Q6 —
componentdiscriminator. RESOLVED 2026-06-04. pgEnumoverride_component(columncomponent) with valueswall_type,roof_type,property_type,built_form_type. Verified these are the exact keys used both in the frontend (columnFields.ts:30-33) and the backend (ClassifiableColumn.name/ handler_build_columns()), so the finaliser maps category → component with no translation. pgEnum over text: small closed set, typos caught at write time — and this is now the only DB-level typing left on a row, sinceoverride_valueis free text. New component = one-lineALTER TYPE … ADD VALUE(Drizzle-owned). Enum namedoverride_*(notproperty_*) to sit withoverride_sourceand stay visually distinct from the existing value enumproperty_type. -
Q7 — store raw description per override row? RESOLVED 2026-06-04: yes, as
original_spreadsheet_description text NOT NULL. Names the source artifact (the spreadsheet cell), not an actor — sidesteps the Landlord-vs-User conflation the glossary warns against, and aligns with CONTEXT.md's "the source file". Stored becauseoverride_valueis a denormalised snapshot that deliberately won't refresh on later vocabulary edits; pinning the original text makes each row self-explaining and re-resolvable even after the sourcelandlord_*_overridesrow changes.NOT NULLis safe iff everyproperty_overridesrow is materialised from alandlord_*_overridesrow (whosedescriptionis itselfNOT NULL) — confirm when settling Q9/source semantics. -
Q8 — uniqueness + FKs. RESOLVED 2026-06-04.
UNIQUE (property_id, override_component, building_part).building_partis in the key (part 0 and part 1 both carry e.g. awall_typerow).sourceis deliberately not in the key — mirrors ADR-0004's single-row-flip (one row, flipsourcein place; the two-row model was rejected).portfolio_idis not in the key (implied byproperty_id) but is kept as a column for query ergonomics and consistency withproperty_details_epc/property_targets, which both denormalise it. FKs:property_id → property.id ON DELETE CASCADE;portfolio_id → portfolio.id ON DELETE CASCADEin the Drizzle migration, but a barebigint(no FK) in the backendPropertyOverrideRowSQLModel mirror — matchinglandlord_wall_type_override_table.py. -
Q9 —
sourcesemantics. RESOLVED 2026-06-04: dropsourceentirely.property_overridesis a pure snapshot of resolved values. Rationale: there is no per-property override concept today (per ADR-0004 edits happen at the vocabulary/portfolio level, flipped in place), so a copiedsourcewould describe the vocabulary mapping's provenance, not this property's — a footgun a reader/re-score rule could misread, and no consumer needs it in v1. When a genuine per-property edit path lands (the real use for per-property provenance),sourcereturns as an additive nullable-column migration — no need to carry it now. This also confirms the Q7NOT NULLcontingency: every row is still materialised from alandlord_*_overridesrow (description NOT NULL). -
Q-scope — v1 scope. RESOLVED 2026-06-04. v1 = the finaliser reproduces today's exact 9-column
propertyinsert (portfolio_id,creation_status='READY',uprn,landlord_property_id←Internal Reference,address= matched ?? user-inputted,postcode,user_inputted_address,user_inputted_postcode,lexiscore) +onConflictDoNothingon(portfolio_id, uprn) where uprn is not null— not a reduced "UPRN + address". This sizes the "PropertyRow gains insertable columns" decision to all nine columns pluscreation_status. Theproperty_overridestable shipped ahead (migration 0221, PR #306) but is not populated in v1 — population is follow-up work (and needs a different input source; see combiner-output note below).
Application-flow questions not yet reached
-
Trigger + orchestration. RESOLVED 2026-06-04. Mirror the
start-address-matchingpath. Next.js creates aSubTask(service: "finaliser") under the BulkUpload's existing Task, then POSTs a new FastAPI endpointPOST /v1/bulk-uploads/trigger-finaliser(auth viavalidate_token), which enqueues to a new SQS queue; a Lambda runs the finaliser wrapped in@subtask_handler(auto-injectedTaskOrchestrator;run_subtaskowns the subtask start/complete/fail + Task cascade). Trigger bodyFinaliserTriggerBody { task_id, sub_task_id, s3_uri (combined output), portfolio_id }(extendsSubtaskTriggerBody). Slow work stays outside the txn; persistence in acommit_scope. The synchronous Next.js/finalizeroute is deleted (locked). -
State machine / who writes
complete. RESOLVED 2026-06-04. New statusfinalisingbetweenawaiting_reviewandcomplete(mirrorscombiningbeforeawaiting_review). Lifecycle:awaiting_review → finalising → complete(↘failed).finalisingwritten by Next.js at dispatch via a compare-and-swap:UPDATE … SET status='finalising' WHERE id=? AND status='awaiting_review'— 0 rows ⇒ already dispatched ⇒ 409. This is the double-dispatch guard (closes the simultaneous-click race underloadForFinalize's existing precondition).complete/failedwritten by the Lambda directly tobulk_address_uploads(newset_finalized_status/set_failed_status, exactly like the combiner'sset_combining_status/set_combined_output_s3_uri).markFinalized+ the Next.js/finalizeroute are deleted.- CONTEXT.md "Two writers" change: Next.js owns dispatch + the
awaiting_review → finalisingCAS; the backend ownsfinalising → completeand→ failed(in addition tocombining/awaiting_review). - UI vs canonical: persisted enum value is
finalising(canonical; ties to the Finalise action). The frontend renders it as "Uploading to ARA" — a display-layer label only, not the enum name, so UX copy never needs a migration.
-
Input — does the combiner output carry the raw description cells? RESOLVED 2026-06-04: NO. This is a v2 problem (deferred). v1 needs only address/UPRN columns, all confirmed present in the combiner output (
address2uprn_uprn,address2uprn_address,address2uprn_lexiscore,Internal Reference,Address 1/2/3,postcode). The rawWalls/Roofs/Property Type/Built Formcells are not in the combiner output — they survive only in (a) the{uploadId}-classifier.csvon S3 (original headers) and (b)landlord_*_overridesas resolved values keyed by description. So v2 population must assemble four inputs, not one file:property_id(identity) ← combiner output(portfolio_id, uprn)— but no-UPRN rows have no such key;- raw cell text ← the classifier CSV (not the combiner output);
- cell → building-part split ←
multiEntryOrderingonbulk_address_uploads; - description →
override_value←landlord_*_overrides(normalized description). - Two open v2 hazards (entry point for the v2 session): (1) the join key
between classifier CSV and combiner output — is there a stable per-row key
(
Internal Reference?) and is row order preserved through postcode-split + combine? (2) obtainingproperty_idfor unmatched (no-UPRN) rows — v1'sonConflictDoNothingreturns no ids, so v2 likely needsRETURNING idmapped back to source rows.
-
Idempotency / re-run. RESOLVED 2026-06-04 (per-table).
property: keep today'sonConflictDoNothingon(portfolio_id, uprn) where uprn is not null— existing properties are not churned.property_overrides:onConflictDoUpdateon(property_id, override_component, building_part)— recalculateoverride_value+original_spreadsheet_description+updated_atto the latest resolution, so an existing property whose override changed is refreshed in place. GuardWHERE source='classifier'once a per-property user-edit path exists (until then every row is classifier-derived, so blind overwrite is correct). See the "Re-run = recalculate" and "Snapshot, not FK" locked decisions.
Key code references (from exploration)
Frontend (assessment-model):
- finalize/route.ts — today's synchronous property insert (to be deleted).
- property.ts —
propertytable (property_type,built_formcolumns exist;uq_property_portfolio_uprn). - landlord_overrides.ts — the four per-component override tables + all pgEnums (
wallTypeEnum,roofTypeEnum,propertyTypeEnum,builtFormTypeEnum,overrideSourceEnum). - bulk_address_uploads.ts —
multiEntryOrdering(permutations,0=main),multiEntrySummary,verifyAck,combinedOutputS3Uri. - ADR-0004 — defers exactly this fact layer; the building-part ordering model.
- ADR-0002 — vocabulary layer.
Backend (Model):
applications/landlord_description_overrides/handler.py— the worker pattern to mirror (subtask_handler,TaskOrchestrator, trigger body,commit_scope).infrastructure/postgres/landlord_wall_type_override_table.py— SQLModel mirror pattern for the newPropertyOverrideRow.infrastructure/postgres/landlord_override_enums.py— sharedoverride_sourceSAEnum pattern.infrastructure/postgres/property_table.py—PropertyRowdefensive view ("backend never inserts" — to change).repositories/landlord_overrides/landlord_override_repository.py— repository pattern for the new override repo.orchestration/landlord_description_overrides_orchestrator.py— orchestrator pattern; note it splits cells into an orderless set (discards part order — recovered viamultiEntryOrdering).- Downstream:
orchestration/property_baseline_orchestrator.py(re-score-on-override seam),orchestration/ingestion_orchestrator.py.
Docs to update (when this lands)
- CONTEXT.md:
Property Type/built_formare per-part-capable, not whole-dwelling. Add the per-Property fact layer (property_overrides) to the glossary + relationships. Possibly abuilding_partindex definition. - New ADR (frontend) for
property_overrides+ finaliser; companion Model ADR for the cross-repo write, citing ADR-0003/0004.