From 744bfa228737816a794a4fce01702bd03f67785e Mon Sep 17 00:00:00 2001 From: Jun-te Kim Date: Thu, 4 Jun 2026 18:21:42 +0000 Subject: [PATCH] Add v2 handover doc for property_overrides population MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Self-contained brief to start a fresh context implementing the per-Property fact layer in the bulk_upload_finaliser: locked decisions, the four-input assembly, the two open hazards (classifier-CSV↔combiner-output join key; property_id for no-UPRN rows), candidate architectures, and a first-steps list. v1 (async finalise writing `property`) is shipped and working. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../bulk-upload-finaliser-v2-handover.md | 208 ++++++++++++++++++ 1 file changed, 208 insertions(+) create mode 100644 docs/design/bulk-upload-finaliser-v2-handover.md diff --git a/docs/design/bulk-upload-finaliser-v2-handover.md b/docs/design/bulk-upload-finaliser-v2-handover.md new file mode 100644 index 00000000..281d4ec4 --- /dev/null +++ b/docs/design/bulk-upload-finaliser-v2-handover.md @@ -0,0 +1,208 @@ +# Handover: `bulk_upload_finaliser` v2 — populate `property_overrides` + +> **Purpose.** Self-contained brief to start a fresh context implementing v2 (the +> per-Property fact layer) in the existing `bulk_upload_finaliser` Lambda. v1 +> (async finalise that writes `property`) is **shipped and working end-to-end**. +> This doc assumes no memory of the v1 session. + +## 1. Where v1 left things (read first) + +v1 made **Finalise** an async dispatched Lambda that writes `property` rows. The +full flow works in dev: dispatch `202` → SQS → Lambda inserts properties + writes +terminal status → UI advances to "Processing complete". + +Authoritative background — **read these before coding**: +- `docs/design/bulk-upload-finaliser.md` — the full grilling/design doc (schema Q6–Q9, snapshot-not-FK, recalculate-on-rerun, the v2 input hazards). +- `docs/adr/0005-async-bulk-upload-finaliser.md` (frontend) — state machine + `property_overrides` shape. +- `/workspaces/home/github/Model/docs/adr/0013-bulk-upload-finaliser-writes-properties.md` (backend) — the Lambda write path + DDD layering. +- `docs/adr/0004-multi-entry-building-part-ordering.md` — **critical for v2**: how building parts and `multiEntryOrdering` work. +- `docs/adr/0002-landlord-override-vocabulary.md` — the vocabulary (`landlord_*_overrides`) layer v2 resolves against. +- `CONTEXT.md` — glossary: **Property override**, **Building part**, **Main building**, **Extension**, **Multi-entry**, **Building-part ordering**, **VocabularyMapping**. + +**Convention that must hold (it was corrected hard in v1):** in the Model repo, +business logic lives in `orchestration/*_orchestrator.py`; the Lambda +`applications/*/handler.py` stays thin (parse trigger, wire infra, delegate). One +repository per aggregate; orchestrators never commit (the handler owns the +transaction via `commit_scope`). See memory `model-ddd-layering`. + +## 2. v2 goal + +Populate `property_overrides` during finalise: for each property, write one row per +`(building_part, override_component)` carrying the **resolved enum snapshot** of the +landlord's description for that part. + +## 3. The target table (already shipped — migration 0221, do NOT re-migrate) + +Drizzle: `src/app/db/schema/property_overrides.ts`. + +``` +property_overrides + id uuid pk + property_id bigint NOT NULL FK → property.id ON DELETE CASCADE + portfolio_id bigint NOT NULL FK → portfolio.id ON DELETE CASCADE + building_part smallint NOT NULL -- 0 = main, 1 = ext 1, 2 = ext 2, … + override_component override_component NOT NULL -- pgEnum {wall_type, roof_type, property_type, built_form_type} + override_value text NOT NULL -- snapshot of the resolved enum value + original_spreadsheet_description text NOT NULL -- raw cell text it resolved from + created_at / updated_at timestamptz NOT NULL + UNIQUE (property_id, override_component, building_part) +``` + +## 4. Design decisions already locked (do not relitigate) + +- **Snapshot, not FK.** `override_value` is a denormalised text copy of the resolved + enum, taken at materialise time — *not* an FK to `landlord_*_overrides`. This is + what lets two properties sharing a description diverge later, and is required + because there are four polymorphic vocabulary tables. Lineage is the natural key + `(portfolio_id, override_component, original_spreadsheet_description)`. +- **Re-run = recalculate.** Write with `onConflictDoUpdate` on + `(property_id, override_component, building_part)`, refreshing `override_value` + + `original_spreadsheet_description` + `updated_at`. (Contrast `property`, which is + `onConflictDoNothing`.) When a per-property user-edit path eventually exists, this + upsert will need a `WHERE source='classifier'` guard — but there is **no `source` + column in v1**; add it as a nullable column only when that path is built. +- **`override_component` values** are exactly the classifier category keys + (`wall_type`, `roof_type`, `property_type`, `built_form_type`) used in both + `src/lib/bulkUpload/columnFields.ts` and the Model `ClassifiableColumn.name` — no + translation. +- **`building_part` indexing**: `0 = Main building`, `1 = Extension 1`, … per ADR-0004. +- **Whole-dwelling components** (`property_type`, `built_form_type`) are per-part- + capable but today's files supply them once → usually written at `building_part = 0`. + +## 5. The hard part: assembling the inputs (this is the real v2 work) + +The combiner output (what the v1 finaliser reads) carries **only** address/UPRN +columns — `Address 1/2/3`, `postcode`, `Internal Reference`, `address2uprn_uprn`, +`address2uprn_address`, `address2uprn_lexiscore`. The **raw `Walls`/`Roofs`/ +`Property Type`/`Built Form` cells are NOT in it.** They live only in: +- the **classifier CSV** on S3 — `bulk_onboarding_inputs/{portfolioId}/{uploadId}-classifier.csv` (original landlord headers), and +- `landlord_*_overrides` in Postgres — the *resolved* values keyed by `(portfolio_id, normalized description)`. + +To write one `property_overrides` row, v2 must assemble **four inputs**: + +| Need | Source | +|---|---| +| `property_id` (identity) | combiner output → `(portfolio_id, uprn)` … **but no-UPRN rows have no key** | +| raw cell text per row | the classifier CSV (not the combiner output) | +| split a multi-valued cell → building parts | `multiEntryOrdering` on `bulk_address_uploads` | +| description → `override_value` | `landlord_*_overrides` (resolve by normalized description) | + +### Two open hazards to resolve first (do these before writing code) + +1. **Join key between the classifier CSV and the combiner output.** Both derive from + the same upload rows, but **row order is NOT preserved** through postcode-split + + combine. So you need a stable per-row key present in *both* files. `Internal + Reference` is the candidate — **verify it survives into both** the address CSV + (→ combiner output) and the classifier CSV. If it doesn't, this is the first thing + to fix. + +2. **`property_id` for unmatched (no-UPRN) rows.** v1's insert is `onConflictDoNothing` + and returns no ids. To attach overrides you need each row's `property.id`. For + UPRN rows you can re-select by `(portfolio_id, uprn)`; **no-UPRN rows can't be + re-found that way.** Likely fix: change the property insert to `RETURNING id` + mapped back to source rows (and decide the dedup/skip semantics for the RETURNING + path, since `onConflictDoNothing` returns nothing for conflicting rows). + +### Two candidate architectures (evaluate against real sample files) + +- **(A) Post-hoc join.** Keep the two files; the finaliser reads the combiner output + (UPRN/identity) and the classifier CSV (descriptions) and joins by `Internal + Reference`. Splits each multi-valued cell into parts via `multiEntryOrdering`, + resolves each part's description against `landlord_*_overrides`, and writes one row + per `(property, part, component)`. Lowest pipeline change; depends entirely on a + reliable join key. +- **(B) Carry descriptions through the pipeline.** Include the description columns in + the *address* CSV at `start-address-matching` so they flow through `address2uprn` + (which preserves input columns via `**row`) into the combiner output. Then the + finaliser reads **one** file with UPRN + descriptions in the same row — no join, no + key hazard. Costs a change to the address-CSV construction (frontend + `start-address-matching` route) and re-verifying `address2uprn`/combiner. Cleaner + long-term; bigger blast radius. **Recommended to seriously consider** — it deletes + hazard #1 entirely. + +## 6. `multiEntryOrdering` — how to split cells into parts + +Persisted on `bulk_address_uploads` (`src/app/db/schema/bulk_address_uploads.ts`): + +```ts +MultiEntryOrdering { permutations: Record; confirmed: boolean } +// permutations[count][k] = the 0-based FILE position holding building part k +// where 0 = Main building, 1..N-1 = Extension 1..N-1. +// e.g. { "2": [1, 0] } => for 2-entry rows, the main building is file position 1. +``` + +A multi-valued cell (e.g. `Walls = "Cavity: …, Solid brick: …"`) splits on commas +into entries by file position; `permutations[count]` maps file position → building +part. **Caveat (ADR-0004):** only the **largest count** permutation is captured this +iteration; other counts need a derivation rule — decide it in v2. +`multiEntrySummary` holds the detected multi-valued columns + **normalized** +description keys (the normalization that matches the classifier's stored keys: +`split → strip → lower`). + +## 7. Resolving description → value (`landlord_*_overrides`) + +Four per-component tables in `src/app/db/schema/landlord_overrides.ts` +(`landlord_wall_type_overrides`, `…_roof_type_…`, `…_property_type_…`, +`…_built_form_type_…`), each `UNIQUE (portfolio_id, description)`, value typed by the +component's pgEnum, plus a `source` (`classifier`|`user`). Resolve a normalized +description → `value`. The frontend already does this read in +`src/lib/bulkUpload/server.ts` (`lookupOverrides`) — mirror that mapping on the +backend. `UNKNOWN` is a legitimate stored value. + +## 8. Backend pieces to build (DDD, mirror v1) + +In `/workspaces/home/github/Model`: +- **`PropertyOverrideRow`** SQLModel mirror → `infrastructure/postgres/property_override_table.py` (mirror the pattern in `property_table.py` / `landlord_*_override_table.py`; reuse a shared `override_component` SAEnum like `landlord_override_enums.py`). +- **Repository** for the override write (one per aggregate): add to + `repositories/property/` (e.g. extend the property repo or a sibling + `property_override` repo), with an `upsert_all` using + `on_conflict_do_update(index_elements=[property_id, override_component, building_part], …)`. +- **Orchestrator logic** in `orchestration/bulk_upload_finaliser_orchestrator.py`: + extend `finalise(...)` (or add a step) to, after inserting properties and getting + ids, build the override rows (join → split by part → resolve) and persist them in + the **same** `commit_scope`. +- **Handler** stays thin — it already wires S3 + engine + repos. It will need the + extra input (classifier CSV and/or `multiEntryOrdering`); decide how those reach + the Lambda (extend `BulkUploadFinaliserTriggerBody`, or read `bulk_address_uploads` + for `multiEntryOrdering` + the classifier S3 URI). The trigger currently carries + `task_id, sub_task_id, s3_uri (combiner output), portfolio_id, bulk_upload_id`. + +Key v1 files to extend (all in the Model repo): +- `applications/bulk_upload_finaliser/handler.py` +- `orchestration/bulk_upload_finaliser_orchestrator.py` +- `repositories/property/property_repository.py` + `property_postgres_repository.py` +- `infrastructure/postgres/property_table.py` (reference for the new mirror) +- `infrastructure/s3/csv_s3_client.py` (`read_rows`) +- Packaging test: `tests/test_lambda_packaging.py` will flag any new top-level import + the Dockerfile doesn't `COPY` (v1 hit this with `datatypes/`). + +## 9. Open questions for v2 to decide + +- Join key confirmed (`Internal Reference` in both files) — or adopt architecture (B)? +- `property_id` for no-UPRN rows: `RETURNING id` strategy + dedup semantics. +- Non-largest-count `multiEntryOrdering` derivation rule (ADR-0004 deferred it). +- Does the trigger body grow, or does the handler read `bulk_address_uploads` + (`multiEntryOrdering`, classifier S3 URI) directly? +- Re-materialise semantics confirmed: recalculate overrides every finalise (snapshot + refreshes), `property` rows untouched. + +## 10. First steps in the new context + +1. Read §1 docs (esp. ADR-0004) + `CONTEXT.md`. +2. Get a **real sample**: the combiner output CSV and the `{uploadId}-classifier.csv` + for one dev upload, and inspect whether `Internal Reference` is in both → settle + hazard #1 / pick architecture (A) vs (B). +3. Decide the `property_id`-for-no-UPRN approach (hazard #2). +4. Build `PropertyOverrideRow` + repository + orchestrator step + handler wiring, + TDD against fakes (mirror `tests/orchestration/test_bulk_upload_finaliser_orchestrator.py`). +5. Update `CONTEXT.md` ("Property override" → populated) and add a v2 ADR if the + join/architecture choice is a real trade-off. + +## 11. Verification notes (environment) + +- Frontend: `npx tsc --noEmit` (was 0 errors at v1 close). +- Model repo: `mypy`/`pytest` need a deps-installed env (the v1 session couldn't run + them locally; `/app` Docker config runs the full suite). `terraform plan` needs the + CLI. Watch `tests/test_lambda_packaging.py` for Dockerfile COPY gaps. +- v1 is committed; dev Lambda + SQS queue are deployed and working + (`FINALISER_SQS_URL` wired in `backend/.env` for local, and in terraform/fast-api).