mirror of
https://github.com/Hestia-Homes/assessment-model.git
synced 2026-06-30 12:55:02 +00:00
243 lines
18 KiB
Markdown
243 lines
18 KiB
Markdown
# Design WIP: `bulk_upload_finaliser` + `property_overrides`
|
|
|
|
> **Status:** v1 fully resolved (grilling 2026-06-04). Ready to graduate to ADR(s).
|
|
> v2 (`property_overrides` population) deferred to its own session — see the
|
|
> "Input" application-flow item for its entry point. When decisions stabilise this should
|
|
> graduate into a new ADR in `docs/adr/` (frontend) and likely a companion ADR
|
|
> in the Model repo, plus a CONTEXT.md update (see "Docs to update").
|
|
|
|
## Goal
|
|
|
|
Two linked pieces of work:
|
|
|
|
1. **New backend application `bulk_upload_finaliser`** (lives in
|
|
`/workspaces/home/github/Model/applications/`, DDD-aligned — study
|
|
`/workspaces/home/github/Model/domain`). It reads the address-matching /
|
|
combiner output **and** the `landlord_*_overrides` vocabulary tables, then
|
|
writes Postgres correctly: the `property` rows (UPRN + address, as the
|
|
frontend does today) and — later — the new `property_overrides` rows.
|
|
Motivation: a property list can be ~40,000 rows, too big for a synchronous
|
|
Next.js HTTP handler.
|
|
|
|
2. **New `property_overrides` table** — the per-Property fact layer that
|
|
**ADR-0004 explicitly deferred** ("the per-Property building-part fact layer
|
|
that consumes `multiEntryOrdering` and writes main/extension facts at
|
|
finalise"). One row per `(property, building_part, component)` carrying the
|
|
resolved enum value + provenance.
|
|
|
|
**Split into two pieces (decided 2026-06-04):**
|
|
|
|
- **v1 — async finaliser writes `property`.** Move today's synchronous Next.js
|
|
`/finalize` property-insert into a dispatched Lambda (`bulk_upload_finaliser`),
|
|
because a property list can be ~40,000 rows. Reproduces the exact 9-column insert
|
|
+ `onConflictDoNothing`, adds the `finalising` status + async state machine, and
|
|
shifts terminal-status ownership to the backend. **Fully designed — ADR-ready.**
|
|
- **v2 — populate `property_overrides`.** The per-Property fact layer. The *table*
|
|
already shipped (migration 0221, PR #306), but population is a **separate
|
|
follow-up** with its own open input-plumbing questions (see the "Input" item
|
|
under application-flow questions). Not designed here.
|
|
|
|
This doc resolves **v1 in full**; v2 gets its own grilling session against real
|
|
classifier-CSV / combiner-output samples.
|
|
|
|
## Where this sits in the existing pipeline
|
|
|
|
```
|
|
BulkUpload → address matching → combiner → awaiting_review → [Finalise]
|
|
│
|
|
(new) bulk_upload_finaliser ──────────┘
|
|
reads: combiner output (S3) + landlord_*_overrides
|
|
writes: property (+ later property_overrides)
|
|
│
|
|
downstream: Ingestion (EPC/solar fetch)
|
|
→ PropertyBaseline (stage 2,
|
|
re-score-on-override seam,
|
|
Model ADR-0011/0012)
|
|
```
|
|
|
|
`Finalise` (the user action + state-machine gate) stays in Next.js; the new
|
|
application is the **worker it dispatches**. Downstream `PropertyBaseline`
|
|
already has an override-aware "re-score" seam — `property_overrides` will feed it.
|
|
|
|
## Decisions locked
|
|
|
|
| # | Decision |
|
|
|---|----------|
|
|
| Name | Application is **`bulk_upload_finaliser`**, in `Model/applications/`. `Finalise` stays the Next.js action that triggers it. |
|
|
| DDD | Follow the DDD structure under `Model/domain`. Domain terms discovered as needed. |
|
|
| Schema ownership | **Drizzle (frontend) owns migrations** for both `property` and the new `property_overrides`. |
|
|
| Backend access | Backend gets a **`PropertyOverrideRow` SQLModel** (mirror, like `landlord_wall_type_override_table.py`) + a **repository** (see `Model/infrastructure/postgres` + `Model/repositories` for examples). `PropertyRow` must drop its "backend never inserts" invariant and gain insertable columns. |
|
|
| Next.js `/finalize` | **Delete it** — fully replaced by the Lambda. |
|
|
| `property_overrides` shape | **Single polymorphic table**, not per-component tables. Accepts losing DB-level pgEnum typing on `value`. |
|
|
| `override_value` | `text` — a **denormalised snapshot copy** (own value per row) of the resolved enum from `landlord_*_overrides` at materialise time. Own-value (not an FK to the vocabulary) is what lets two properties sharing a description later **diverge**, and lets re-run recalculate one property's value without touching its siblings. |
|
|
| Snapshot, **not** FK to vocabulary | `property_overrides` does **not** foreign-key the originating `landlord_*_overrides` row. An FK forces every property sharing a description to share one value (forbids divergence); is structurally impossible as a real FK (4 polymorphic target tables, each with its own value enum); and would risk cascade-deleting per-property facts when re-classification prunes a vocabulary row. Lineage is preserved as a **natural key** — `(portfolio_id, override_component, original_spreadsheet_description)` re-finds the vocabulary row (its `UNIQUE` is `(portfolio_id, description)`) — so deliberate re-sync needs no surrogate FK. |
|
|
| Re-run = **recalculate** | The finaliser write to `property_overrides` is `onConflictDoUpdate` on `(property_id, override_component, building_part)`, refreshing `override_value` + `original_spreadsheet_description` + `updated_at` to the latest resolution. Contrast `property`, which stays `onConflictDoNothing` (identity rows, don't churn). When per-property `source='user'` edits exist, the update must guard `WHERE source='classifier'` to preserve hand-edits (mirrors the Model classifier upsert). |
|
|
| `building_part` | **`smallint NOT NULL`**, explicit index: `0 = main building, 1 = extension 1, 2 = extension 2, …` (matches ADR-0004 `multiEntryOrdering.permutations` indexing). |
|
|
| Whole-dwelling components | **No special case.** `property_type`/`built_form` are *per-part-capable* too (an extension — conservatory, summer house — can be a different built form / property type). Today's files only supply them once, so they'll usually be written at `building_part = 0` only, but the schema allows per-part with no future migration. |
|
|
|
|
## `property_overrides` — columns so far
|
|
|
|
Roughly (subject to remaining open questions):
|
|
|
|
```
|
|
property_overrides
|
|
id uuid pk (default random) -- match landlord_* tables
|
|
property_id bigint NOT NULL FK → property.id (FE-owned table)
|
|
portfolio_id bigint NOT NULL FK → portfolio.id
|
|
building_part smallint NOT NULL -- 0 = main, 1 = ext 1, 2 = ext 2, …
|
|
override_component override_component NOT NULL -- column name == enum type name; pgEnum {wall_type, roof_type, property_type, built_form_type} (Q6 ✓)
|
|
override_value text NOT NULL -- snapshot copy of landlord_* resolved enum (free text; `override_component` carries the typing)
|
|
-- (no `source`) — dropped Q9: pure value snapshot; add back as nullable column if/when a per-property edit path needs provenance
|
|
original_spreadsheet_description text NOT NULL -- raw spreadsheet cell text this snapshot resolved from (Q7 ✓)
|
|
created_at timestamptz NOT NULL default now()
|
|
updated_at timestamptz NOT NULL default now()
|
|
-- UNIQUE (property_id, override_component, building_part) -- Q8 ✓ (source NOT in key — mirrors ADR-0004 single-row flip; portfolio_id implied by property_id)
|
|
-- FK property_id → property.id ON DELETE CASCADE; portfolio_id → portfolio.id (Drizzle only; bare bigint in SQLModel mirror); portfolio_id kept (matches property_details_epc / property_targets)
|
|
```
|
|
|
|
## Open questions (resume here)
|
|
|
|
- **Q6 — `component` discriminator. RESOLVED 2026-06-04.** pgEnum
|
|
**`override_component`** (column `component`) with values
|
|
**`wall_type`, `roof_type`, `property_type`, `built_form_type`**. Verified these
|
|
are the *exact* keys used both in the frontend
|
|
([columnFields.ts:30-33](../../src/lib/bulkUpload/columnFields.ts#L30-L33)) and
|
|
the backend (`ClassifiableColumn.name` / handler `_build_columns()`), so the
|
|
finaliser maps category → component with **no translation**. pgEnum over text:
|
|
small closed set, typos caught at write time — and this is now the *only*
|
|
DB-level typing left on a row, since `override_value` is free text. New component
|
|
= one-line `ALTER TYPE … ADD VALUE` (Drizzle-owned). Enum named `override_*`
|
|
(not `property_*`) to sit with `override_source` and stay visually distinct from
|
|
the existing *value* enum `property_type`.
|
|
|
|
- **Q7 — store raw description per override row? RESOLVED 2026-06-04: yes, as
|
|
`original_spreadsheet_description text NOT NULL`.** Names the *source artifact*
|
|
(the spreadsheet cell), not an actor — sidesteps the Landlord-vs-User conflation
|
|
the glossary warns against, and aligns with CONTEXT.md's "the source file". Stored
|
|
because `override_value` is a denormalised snapshot that deliberately won't
|
|
refresh on later vocabulary edits; pinning the original text makes each row
|
|
self-explaining and re-resolvable even after the source `landlord_*_overrides` row
|
|
changes. `NOT NULL` is safe **iff** every `property_overrides` row is materialised
|
|
from a `landlord_*_overrides` row (whose `description` is itself `NOT NULL`) —
|
|
confirm when settling Q9/source semantics.
|
|
|
|
- **Q8 — uniqueness + FKs. RESOLVED 2026-06-04.**
|
|
`UNIQUE (property_id, override_component, building_part)`. `building_part` is in
|
|
the key (part 0 and part 1 both carry e.g. a `wall_type` row). `source` is
|
|
**deliberately not** in the key — mirrors ADR-0004's single-row-flip (one row,
|
|
flip `source` in place; the two-row model was rejected). `portfolio_id` is not in
|
|
the key (implied by `property_id`) but **is kept as a column** for query ergonomics
|
|
and consistency with `property_details_epc` / `property_targets`, which both
|
|
denormalise it. FKs: `property_id → property.id ON DELETE CASCADE`;
|
|
`portfolio_id → portfolio.id ON DELETE CASCADE` in the Drizzle migration, but a
|
|
bare `bigint` (no FK) in the backend `PropertyOverrideRow` SQLModel mirror —
|
|
matching `landlord_wall_type_override_table.py`.
|
|
|
|
- **Q9 — `source` semantics. RESOLVED 2026-06-04: drop `source` entirely.**
|
|
`property_overrides` is a pure **snapshot of resolved values**. Rationale: there
|
|
is no per-property override concept today (per ADR-0004 edits happen at the
|
|
**vocabulary/portfolio** level, flipped in place), so a copied `source` would
|
|
describe the *vocabulary mapping's* provenance, not this property's — a footgun a
|
|
reader/re-score rule could misread, and no consumer needs it in v1. When a genuine
|
|
per-property edit path lands (the real use for per-property provenance), `source`
|
|
returns as an **additive nullable-column migration** — no need to carry it now.
|
|
This also confirms the Q7 `NOT NULL` contingency: every row is still materialised
|
|
from a `landlord_*_overrides` row (`description NOT NULL`).
|
|
|
|
- **Q-scope — v1 scope. RESOLVED 2026-06-04.** v1 = the finaliser reproduces
|
|
today's **exact 9-column `property` insert** (`portfolio_id`,
|
|
`creation_status='READY'`, `uprn`, `landlord_property_id` ← `Internal Reference`,
|
|
`address` = matched ?? user-inputted, `postcode`, `user_inputted_address`,
|
|
`user_inputted_postcode`, `lexiscore`) **+** `onConflictDoNothing` on
|
|
`(portfolio_id, uprn) where uprn is not null` — not a reduced "UPRN + address".
|
|
This sizes the "PropertyRow gains insertable columns" decision to all nine
|
|
columns plus `creation_status`. The `property_overrides` *table* shipped ahead
|
|
(migration 0221, PR #306) but is **not populated** in v1 — population is
|
|
follow-up work (and needs a different input source; see combiner-output note
|
|
below).
|
|
|
|
### Application-flow questions not yet reached
|
|
|
|
- **Trigger + orchestration. RESOLVED 2026-06-04.** Mirror the
|
|
`start-address-matching` path. Next.js creates a `SubTask` (`service:
|
|
"finaliser"`) under the BulkUpload's existing Task, then POSTs a new FastAPI
|
|
endpoint `POST /v1/bulk-uploads/trigger-finaliser` (auth via `validate_token`),
|
|
which enqueues to a **new SQS queue**; a Lambda runs the finaliser wrapped in
|
|
`@subtask_handler` (auto-injected `TaskOrchestrator`; `run_subtask` owns the
|
|
subtask start/complete/fail + Task cascade). Trigger body
|
|
`FinaliserTriggerBody { task_id, sub_task_id, s3_uri (combined output), portfolio_id }`
|
|
(extends `SubtaskTriggerBody`). Slow work stays outside the txn; persistence in a
|
|
`commit_scope`. The synchronous Next.js `/finalize` route is deleted (locked).
|
|
|
|
- **State machine / who writes `complete`. RESOLVED 2026-06-04.** New status
|
|
**`finalising`** between `awaiting_review` and `complete` (mirrors `combining`
|
|
before `awaiting_review`). Lifecycle:
|
|
`awaiting_review → finalising → complete` (↘ `failed`).
|
|
- **`finalising`** written by **Next.js at dispatch** via a **compare-and-swap**:
|
|
`UPDATE … SET status='finalising' WHERE id=? AND status='awaiting_review'` —
|
|
0 rows ⇒ already dispatched ⇒ 409. This is the **double-dispatch guard** (closes
|
|
the simultaneous-click race under `loadForFinalize`'s existing precondition).
|
|
- **`complete` / `failed`** written by the **Lambda** directly to
|
|
`bulk_address_uploads` (new `set_finalized_status` / `set_failed_status`,
|
|
exactly like the combiner's `set_combining_status` /
|
|
`set_combined_output_s3_uri`). `markFinalized` + the Next.js `/finalize` route
|
|
are deleted.
|
|
- **CONTEXT.md "Two writers" change:** Next.js owns dispatch + the
|
|
`awaiting_review → finalising` CAS; the backend owns `finalising → complete`
|
|
and `→ failed` (in addition to `combining` / `awaiting_review`).
|
|
- **UI vs canonical:** persisted enum value is `finalising` (canonical; ties to
|
|
the **Finalise** action). The frontend renders it as **"Uploading to ARA"** — a
|
|
display-layer label only, **not** the enum name, so UX copy never needs a
|
|
migration.
|
|
|
|
- **Input — does the combiner output carry the raw description cells? RESOLVED
|
|
2026-06-04: NO. This is a v2 problem (deferred).** v1 needs only address/UPRN
|
|
columns, all **confirmed present** in the combiner output (`address2uprn_uprn`,
|
|
`address2uprn_address`, `address2uprn_lexiscore`, `Internal Reference`,
|
|
`Address 1/2/3`, `postcode`). The raw `Walls`/`Roofs`/`Property Type`/`Built Form`
|
|
cells are **not** in the combiner output — they survive only in (a) the
|
|
`{uploadId}-classifier.csv` on S3 (original headers) and (b) `landlord_*_overrides`
|
|
as *resolved* values keyed by description. So v2 population must assemble **four
|
|
inputs**, not one file:
|
|
- `property_id` (identity) ← combiner output `(portfolio_id, uprn)` — **but
|
|
no-UPRN rows have no such key**;
|
|
- raw cell text ← the classifier CSV (not the combiner output);
|
|
- cell → building-part split ← `multiEntryOrdering` on `bulk_address_uploads`;
|
|
- description → `override_value` ← `landlord_*_overrides` (normalized description).
|
|
- **Two open v2 hazards (entry point for the v2 session):** (1) the join key
|
|
between classifier CSV and combiner output — is there a stable per-row key
|
|
(`Internal Reference`?) and is row order preserved through postcode-split +
|
|
combine? (2) obtaining `property_id` for unmatched (no-UPRN) rows — v1's
|
|
`onConflictDoNothing` returns no ids, so v2 likely needs `RETURNING id` mapped
|
|
back to source rows.
|
|
|
|
- **Idempotency / re-run. RESOLVED 2026-06-04 (per-table).**
|
|
- `property`: keep today's `onConflictDoNothing` on `(portfolio_id, uprn) where uprn is not null` — existing properties are not churned.
|
|
- `property_overrides`: `onConflictDoUpdate` on `(property_id, override_component, building_part)` — **recalculate** `override_value` + `original_spreadsheet_description` + `updated_at` to the latest resolution, so an existing property whose override changed is refreshed in place. Guard `WHERE source='classifier'` once a per-property user-edit path exists (until then every row is classifier-derived, so blind overwrite is correct). See the "Re-run = recalculate" and "Snapshot, not FK" locked decisions.
|
|
|
|
## Key code references (from exploration)
|
|
|
|
**Frontend (`assessment-model`):**
|
|
- [finalize/route.ts](../../src/app/api/portfolio/[portfolioId]/bulk-uploads/[uploadId]/finalize/route.ts) — today's synchronous property insert (to be deleted).
|
|
- [property.ts](../../src/app/db/schema/property.ts) — `property` table (`property_type`, `built_form` columns exist; `uq_property_portfolio_uprn`).
|
|
- [landlord_overrides.ts](../../src/app/db/schema/landlord_overrides.ts) — the four per-component override tables + all pgEnums (`wallTypeEnum`, `roofTypeEnum`, `propertyTypeEnum`, `builtFormTypeEnum`, `overrideSourceEnum`).
|
|
- [bulk_address_uploads.ts](../../src/app/db/schema/bulk_address_uploads.ts) — `multiEntryOrdering` (permutations, `0=main`), `multiEntrySummary`, `verifyAck`, `combinedOutputS3Uri`.
|
|
- [ADR-0004](../adr/0004-multi-entry-building-part-ordering.md) — defers exactly this fact layer; the building-part ordering model.
|
|
- [ADR-0002](../adr/0002-landlord-override-vocabulary.md) — vocabulary layer.
|
|
|
|
**Backend (`Model`):**
|
|
- `applications/landlord_description_overrides/handler.py` — the worker pattern to mirror (`subtask_handler`, `TaskOrchestrator`, trigger body, `commit_scope`).
|
|
- `infrastructure/postgres/landlord_wall_type_override_table.py` — SQLModel mirror pattern for the new `PropertyOverrideRow`.
|
|
- `infrastructure/postgres/landlord_override_enums.py` — shared `override_source` SAEnum pattern.
|
|
- `infrastructure/postgres/property_table.py` — `PropertyRow` defensive view ("backend never inserts" — to change).
|
|
- `repositories/landlord_overrides/landlord_override_repository.py` — repository pattern for the new override repo.
|
|
- `orchestration/landlord_description_overrides_orchestrator.py` — orchestrator pattern; note it splits cells into an orderless set (discards part order — recovered via `multiEntryOrdering`).
|
|
- Downstream: `orchestration/property_baseline_orchestrator.py` (re-score-on-override seam), `orchestration/ingestion_orchestrator.py`.
|
|
|
|
## Docs to update (when this lands)
|
|
|
|
- **CONTEXT.md**: `Property Type` / `built_form` are **per-part-capable**, not
|
|
whole-dwelling. Add the per-Property fact layer (`property_overrides`) to the
|
|
glossary + relationships. Possibly a `building_part` index definition.
|
|
- **New ADR** (frontend) for `property_overrides` + finaliser; companion Model ADR
|
|
for the cross-repo write, citing ADR-0003/0004.
|