Design WIP: `bulk_upload_finaliser` + `property_overrides`

Status: v1 fully resolved (grilling 2026-06-04). Ready to graduate to ADR(s). v2 (property_overrides population) deferred to its own session — see the "Input" application-flow item for its entry point. When decisions stabilise this should graduate into a new ADR in docs/adr/ (frontend) and likely a companion ADR in the Model repo, plus a CONTEXT.md update (see "Docs to update").

Goal

Two linked pieces of work:

New backend application bulk_upload_finaliser (lives in /workspaces/home/github/Model/applications/, DDD-aligned — study /workspaces/home/github/Model/domain). It reads the address-matching / combiner output and the landlord_*_overrides vocabulary tables, then writes Postgres correctly: the property rows (UPRN + address, as the frontend does today) and — later — the new property_overrides rows. Motivation: a property list can be ~40,000 rows, too big for a synchronous Next.js HTTP handler.
New property_overrides table — the per-Property fact layer that ADR-0004 explicitly deferred ("the per-Property building-part fact layer that consumes multiEntryOrdering and writes main/extension facts at finalise"). One row per (property, building_part, component) carrying the resolved enum value + provenance.

Split into two pieces (decided 2026-06-04):

v1 — async finaliser writes property. Move today's synchronous Next.js /finalize property-insert into a dispatched Lambda (bulk_upload_finaliser), because a property list can be ~40,000 rows. Reproduces the exact 9-column insert
- onConflictDoNothing, adds the finalising status + async state machine, and shifts terminal-status ownership to the backend. Fully designed — ADR-ready.
v2 — populate property_overrides. The per-Property fact layer. The table already shipped (migration 0221, PR #306), but population is a separate follow-up with its own open input-plumbing questions (see the "Input" item under application-flow questions). Not designed here.

This doc resolves v1 in full; v2 gets its own grilling session against real classifier-CSV / combiner-output samples.

Where this sits in the existing pipeline

BulkUpload → address matching → combiner → awaiting_review → [Finalise]
                                                                  │
                            (new) bulk_upload_finaliser ──────────┘
                              reads: combiner output (S3) + landlord_*_overrides
                              writes: property (+ later property_overrides)
                                                                  │
                                          downstream: Ingestion (EPC/solar fetch)
                                                    → PropertyBaseline (stage 2,
                                                      re-score-on-override seam,
                                                      Model ADR-0011/0012)

Finalise (the user action + state-machine gate) stays in Next.js; the new application is the worker it dispatches. Downstream PropertyBaseline already has an override-aware "re-score" seam — property_overrides will feed it.

Decisions locked

#	Decision
Name	Application is `bulk_upload_finaliser`, in `Model/applications/`. `Finalise` stays the Next.js action that triggers it.
DDD	Follow the DDD structure under `Model/domain`. Domain terms discovered as needed.
Schema ownership	Drizzle (frontend) owns migrations for both `property` and the new `property_overrides`.
Backend access	Backend gets a `PropertyOverrideRow` SQLModel (mirror, like `landlord_wall_type_override_table.py`) + a repository (see `Model/infrastructure/postgres` + `Model/repositories` for examples). `PropertyRow` must drop its "backend never inserts" invariant and gain insertable columns.
Next.js `/finalize`	Delete it — fully replaced by the Lambda.
`property_overrides` shape	Single polymorphic table, not per-component tables. Accepts losing DB-level pgEnum typing on `value`.
`override_value`	`text` — a denormalised snapshot copy (own value per row) of the resolved enum from `landlord__overrides` at materialise time. Own-value (not an FK to the vocabulary) is what lets two properties sharing a description later diverge*, and lets re-run recalculate one property's value without touching its siblings.
Snapshot, not FK to vocabulary	`property_overrides` does not foreign-key the originating `landlord__overrides` row. An FK forces every property sharing a description to share one value (forbids divergence); is structurally impossible as a real FK (4 polymorphic target tables, each with its own value enum); and would risk cascade-deleting per-property facts when re-classification prunes a vocabulary row. Lineage is preserved as a natural key* — `(portfolio_id, override_component, original_spreadsheet_description)` re-finds the vocabulary row (its `UNIQUE` is `(portfolio_id, description)`) — so deliberate re-sync needs no surrogate FK.
Re-run = recalculate	The finaliser write to `property_overrides` is `onConflictDoUpdate` on `(property_id, override_component, building_part)`, refreshing `override_value` + `original_spreadsheet_description` + `updated_at` to the latest resolution. Contrast `property`, which stays `onConflictDoNothing` (identity rows, don't churn). When per-property `source='user'` edits exist, the update must guard `WHERE source='classifier'` to preserve hand-edits (mirrors the Model classifier upsert).
`building_part`	`smallint NOT NULL`, explicit index: `0 = main building, 1 = extension 1, 2 = extension 2, …` (matches ADR-0004 `multiEntryOrdering.permutations` indexing).
Whole-dwelling components	No special case. `property_type`/`built_form` are per-part-capable too (an extension — conservatory, summer house — can be a different built form / property type). Today's files only supply them once, so they'll usually be written at `building_part = 0` only, but the schema allows per-part with no future migration.

`property_overrides` — columns so far

Roughly (subject to remaining open questions):

property_overrides
  id            uuid pk (default random)         -- match landlord_* tables
  property_id   bigint  NOT NULL  FK → property.id   (FE-owned table)
  portfolio_id  bigint  NOT NULL  FK → portfolio.id
  building_part  smallint NOT NULL  -- 0 = main, 1 = ext 1, 2 = ext 2, …
  override_component  override_component NOT NULL  -- column name == enum type name; pgEnum {wall_type, roof_type, property_type, built_form_type} (Q6 ✓)
  override_value text   NOT NULL   -- snapshot copy of landlord_* resolved enum (free text; `override_component` carries the typing)
  -- (no `source`) — dropped Q9: pure value snapshot; add back as nullable column if/when a per-property edit path needs provenance
  original_spreadsheet_description text NOT NULL  -- raw spreadsheet cell text this snapshot resolved from (Q7 ✓)
  created_at    timestamptz NOT NULL default now()
  updated_at    timestamptz NOT NULL default now()
  -- UNIQUE (property_id, override_component, building_part)   -- Q8 ✓ (source NOT in key — mirrors ADR-0004 single-row flip; portfolio_id implied by property_id)
  -- FK property_id → property.id ON DELETE CASCADE; portfolio_id → portfolio.id (Drizzle only; bare bigint in SQLModel mirror); portfolio_id kept (matches property_details_epc / property_targets)

Open questions (resume here)

Q6 — component discriminator. RESOLVED 2026-06-04. pgEnum override_component (column component) with values wall_type, roof_type, property_type, built_form_type. Verified these are the exact keys used both in the frontend (columnFields.ts:30-33) and the backend (ClassifiableColumn.name / handler _build_columns()), so the finaliser maps category → component with no translation. pgEnum over text: small closed set, typos caught at write time — and this is now the only DB-level typing left on a row, since override_value is free text. New component = one-line ALTER TYPE … ADD VALUE (Drizzle-owned). Enum named override_* (not property_*) to sit with override_source and stay visually distinct from the existing value enum property_type.
Q7 — store raw description per override row? RESOLVED 2026-06-04: yes, as original_spreadsheet_description text NOT NULL. Names the source artifact (the spreadsheet cell), not an actor — sidesteps the Landlord-vs-User conflation the glossary warns against, and aligns with CONTEXT.md's "the source file". Stored because override_value is a denormalised snapshot that deliberately won't refresh on later vocabulary edits; pinning the original text makes each row self-explaining and re-resolvable even after the source landlord_*_overrides row changes. NOT NULL is safe iff every property_overrides row is materialised from a landlord_*_overrides row (whose description is itself NOT NULL) — confirm when settling Q9/source semantics.
Q8 — uniqueness + FKs. RESOLVED 2026-06-04. UNIQUE (property_id, override_component, building_part). building_part is in the key (part 0 and part 1 both carry e.g. a wall_type row). source is deliberately not in the key — mirrors ADR-0004's single-row-flip (one row, flip source in place; the two-row model was rejected). portfolio_id is not in the key (implied by property_id) but is kept as a column for query ergonomics and consistency with property_details_epc / property_targets, which both denormalise it. FKs: property_id → property.id ON DELETE CASCADE; portfolio_id → portfolio.id ON DELETE CASCADE in the Drizzle migration, but a bare bigint (no FK) in the backend PropertyOverrideRow SQLModel mirror — matching landlord_wall_type_override_table.py.
Q9 — source semantics. RESOLVED 2026-06-04: drop source entirely. property_overrides is a pure snapshot of resolved values. Rationale: there is no per-property override concept today (per ADR-0004 edits happen at the vocabulary/portfolio level, flipped in place), so a copied source would describe the vocabulary mapping's provenance, not this property's — a footgun a reader/re-score rule could misread, and no consumer needs it in v1. When a genuine per-property edit path lands (the real use for per-property provenance), source returns as an additive nullable-column migration — no need to carry it now. This also confirms the Q7 NOT NULL contingency: every row is still materialised from a landlord_*_overrides row (description NOT NULL).
Q-scope — v1 scope. RESOLVED 2026-06-04. v1 = the finaliser reproduces today's exact 9-column property insert (portfolio_id, creation_status='READY', uprn, landlord_property_id ← Internal Reference, address = matched ?? user-inputted, postcode, user_inputted_address, user_inputted_postcode, lexiscore) + onConflictDoNothing on (portfolio_id, uprn) where uprn is not null — not a reduced "UPRN + address". This sizes the "PropertyRow gains insertable columns" decision to all nine columns plus creation_status. The property_overrides table shipped ahead (migration 0221, PR #306) but is not populated in v1 — population is follow-up work (and needs a different input source; see combiner-output note below).

Application-flow questions not yet reached

Trigger + orchestration. RESOLVED 2026-06-04. Mirror the start-address-matching path. Next.js creates a SubTask (service: "finaliser") under the BulkUpload's existing Task, then POSTs a new FastAPI endpoint POST /v1/bulk-uploads/trigger-finaliser (auth via validate_token), which enqueues to a new SQS queue; a Lambda runs the finaliser wrapped in @subtask_handler (auto-injected TaskOrchestrator; run_subtask owns the subtask start/complete/fail + Task cascade). Trigger body FinaliserTriggerBody { task_id, sub_task_id, s3_uri (combined output), portfolio_id } (extends SubtaskTriggerBody). Slow work stays outside the txn; persistence in a commit_scope. The synchronous Next.js /finalize route is deleted (locked).
State machine / who writes complete. RESOLVED 2026-06-04. New status finalising between awaiting_review and complete (mirrors combining before awaiting_review). Lifecycle: awaiting_review → finalising → complete (↘ failed).
- finalising written by Next.js at dispatch via a compare-and-swap: UPDATE … SET status='finalising' WHERE id=? AND status='awaiting_review' — 0 rows ⇒ already dispatched ⇒ 409. This is the double-dispatch guard (closes the simultaneous-click race under loadForFinalize's existing precondition).
- complete / failed written by the Lambda directly to bulk_address_uploads (new set_finalized_status / set_failed_status, exactly like the combiner's set_combining_status / set_combined_output_s3_uri). markFinalized + the Next.js /finalize route are deleted.
- CONTEXT.md "Two writers" change: Next.js owns dispatch + the awaiting_review → finalising CAS; the backend owns finalising → complete and → failed (in addition to combining / awaiting_review).
- UI vs canonical: persisted enum value is finalising (canonical; ties to the Finalise action). The frontend renders it as "Uploading to ARA" — a display-layer label only, not the enum name, so UX copy never needs a migration.
Input — does the combiner output carry the raw description cells? RESOLVED 2026-06-04: NO. This is a v2 problem (deferred). v1 needs only address/UPRN columns, all confirmed present in the combiner output (address2uprn_uprn, address2uprn_address, address2uprn_lexiscore, Internal Reference, Address 1/2/3, postcode). The raw Walls/Roofs/Property Type/Built Form cells are not in the combiner output — they survive only in (a) the {uploadId}-classifier.csv on S3 (original headers) and (b) landlord_*_overrides as resolved values keyed by description. So v2 population must assemble four inputs, not one file:
- property_id (identity) ← combiner output (portfolio_id, uprn) — but no-UPRN rows have no such key;
- raw cell text ← the classifier CSV (not the combiner output);
- cell → building-part split ← multiEntryOrdering on bulk_address_uploads;
- description → override_value ← landlord_*_overrides (normalized description).
- Two open v2 hazards (entry point for the v2 session): (1) the join key between classifier CSV and combiner output — is there a stable per-row key (Internal Reference?) and is row order preserved through postcode-split + combine? (2) obtaining property_id for unmatched (no-UPRN) rows — v1's onConflictDoNothing returns no ids, so v2 likely needs RETURNING id mapped back to source rows.
Idempotency / re-run. RESOLVED 2026-06-04 (per-table).
- property: keep today's onConflictDoNothing on (portfolio_id, uprn) where uprn is not null — existing properties are not churned.
- property_overrides: onConflictDoUpdate on (property_id, override_component, building_part) — recalculate override_value + original_spreadsheet_description + updated_at to the latest resolution, so an existing property whose override changed is refreshed in place. Guard WHERE source='classifier' once a per-property user-edit path exists (until then every row is classifier-derived, so blind overwrite is correct). See the "Re-run = recalculate" and "Snapshot, not FK" locked decisions.

Key code references (from exploration)

Frontend (assessment-model):

finalize/route.ts — today's synchronous property insert (to be deleted).
property.ts — property table (property_type, built_form columns exist; uq_property_portfolio_uprn).
landlord_overrides.ts — the four per-component override tables + all pgEnums (wallTypeEnum, roofTypeEnum, propertyTypeEnum, builtFormTypeEnum, overrideSourceEnum).
bulk_address_uploads.ts — multiEntryOrdering (permutations, 0=main), multiEntrySummary, verifyAck, combinedOutputS3Uri.
ADR-0004 — defers exactly this fact layer; the building-part ordering model.
ADR-0002 — vocabulary layer.

Backend (Model):

applications/landlord_description_overrides/handler.py — the worker pattern to mirror (subtask_handler, TaskOrchestrator, trigger body, commit_scope).
infrastructure/postgres/landlord_wall_type_override_table.py — SQLModel mirror pattern for the new PropertyOverrideRow.
infrastructure/postgres/landlord_override_enums.py — shared override_source SAEnum pattern.
infrastructure/postgres/property_table.py — PropertyRow defensive view ("backend never inserts" — to change).
repositories/landlord_overrides/landlord_override_repository.py — repository pattern for the new override repo.
orchestration/landlord_description_overrides_orchestrator.py — orchestrator pattern; note it splits cells into an orderless set (discards part order — recovered via multiEntryOrdering).
Downstream: orchestration/property_baseline_orchestrator.py (re-score-on-override seam), orchestration/ingestion_orchestrator.py.

Docs to update (when this lands)

CONTEXT.md: Property Type / built_form are per-part-capable, not whole-dwelling. Add the per-Property fact layer (property_overrides) to the glossary + relationships. Possibly a building_part index definition.
New ADR (frontend) for property_overrides + finaliser; companion Model ADR for the cross-repo write, citing ADR-0003/0004.

18 KiB Raw Blame History

Design WIP: bulk_upload_finaliser + property_overrides