Add v2 handover doc for property_overrides population

Self-contained brief to start a fresh context implementing the per-Property fact layer in the bulk_upload_finaliser: locked decisions, the four-input assembly, the two open hazards (classifier-CSV↔combiner-output join key; property_id for no-UPRN rows), candidate architectures, and a first-steps list. v1 (async finalise writing `property`) is shipped and working. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-30 12:55:02 +00:00 · 2026-06-04 18:21:42 +00:00 · 2026-06-04 18:21:42 +00:00 · 744bfa2287
commit 744bfa2287
parent b9bbd916b5
1 changed files with 208 additions and 0 deletions
--- a/docs/design/bulk-upload-finaliser-v2-handover.md
+++ b/docs/design/bulk-upload-finaliser-v2-handover.md
@ -0,0 +1,208 @@
+# Handover: `bulk_upload_finaliser` v2 — populate `property_overrides`
+
+> **Purpose.** Self-contained brief to start a fresh context implementing v2 (the
+> per-Property fact layer) in the existing `bulk_upload_finaliser` Lambda. v1
+> (async finalise that writes `property`) is **shipped and working end-to-end**.
+> This doc assumes no memory of the v1 session.
+
+## 1. Where v1 left things (read first)
+
+v1 made **Finalise** an async dispatched Lambda that writes `property` rows. The
+full flow works in dev: dispatch `202` → SQS → Lambda inserts properties + writes
+terminal status → UI advances to "Processing complete".
+
+Authoritative background — **read these before coding**:
+- `docs/design/bulk-upload-finaliser.md` — the full grilling/design doc (schema Q6–Q9, snapshot-not-FK, recalculate-on-rerun, the v2 input hazards).
+- `docs/adr/0005-async-bulk-upload-finaliser.md` (frontend) — state machine + `property_overrides` shape.
+- `/workspaces/home/github/Model/docs/adr/0013-bulk-upload-finaliser-writes-properties.md` (backend) — the Lambda write path + DDD layering.
+- `docs/adr/0004-multi-entry-building-part-ordering.md` — **critical for v2**: how building parts and `multiEntryOrdering` work.
+- `docs/adr/0002-landlord-override-vocabulary.md` — the vocabulary (`landlord_*_overrides`) layer v2 resolves against.
+- `CONTEXT.md` — glossary: **Property override**, **Building part**, **Main building**, **Extension**, **Multi-entry**, **Building-part ordering**, **VocabularyMapping**.
+
+**Convention that must hold (it was corrected hard in v1):** in the Model repo,
+business logic lives in `orchestration/*_orchestrator.py`; the Lambda
+`applications/*/handler.py` stays thin (parse trigger, wire infra, delegate). One
+repository per aggregate; orchestrators never commit (the handler owns the
+transaction via `commit_scope`). See memory `model-ddd-layering`.
+
+## 2. v2 goal
+
+Populate `property_overrides` during finalise: for each property, write one row per
+`(building_part, override_component)` carrying the **resolved enum snapshot** of the
+landlord's description for that part.
+
+## 3. The target table (already shipped — migration 0221, do NOT re-migrate)
+
+Drizzle: `src/app/db/schema/property_overrides.ts`.
+
+```
+property_overrides
+  id                              uuid pk
+  property_id                     bigint NOT NULL  FK → property.id   ON DELETE CASCADE
+  portfolio_id                    bigint NOT NULL  FK → portfolio.id  ON DELETE CASCADE
+  building_part                   smallint NOT NULL          -- 0 = main, 1 = ext 1, 2 = ext 2, …
+  override_component              override_component NOT NULL -- pgEnum {wall_type, roof_type, property_type, built_form_type}
+  override_value                  text NOT NULL              -- snapshot of the resolved enum value
+  original_spreadsheet_description text NOT NULL             -- raw cell text it resolved from
+  created_at / updated_at         timestamptz NOT NULL
+  UNIQUE (property_id, override_component, building_part)
+```
+
+## 4. Design decisions already locked (do not relitigate)
+
+- **Snapshot, not FK.** `override_value` is a denormalised text copy of the resolved
+  enum, taken at materialise time — *not* an FK to `landlord_*_overrides`. This is
+  what lets two properties sharing a description diverge later, and is required
+  because there are four polymorphic vocabulary tables. Lineage is the natural key
+  `(portfolio_id, override_component, original_spreadsheet_description)`.
+- **Re-run = recalculate.** Write with `onConflictDoUpdate` on
+  `(property_id, override_component, building_part)`, refreshing `override_value` +
+  `original_spreadsheet_description` + `updated_at`. (Contrast `property`, which is
+  `onConflictDoNothing`.) When a per-property user-edit path eventually exists, this
+  upsert will need a `WHERE source='classifier'` guard — but there is **no `source`
+  column in v1**; add it as a nullable column only when that path is built.
+- **`override_component` values** are exactly the classifier category keys
+  (`wall_type`, `roof_type`, `property_type`, `built_form_type`) used in both
+  `src/lib/bulkUpload/columnFields.ts` and the Model `ClassifiableColumn.name` — no
+  translation.
+- **`building_part` indexing**: `0 = Main building`, `1 = Extension 1`, … per ADR-0004.
+- **Whole-dwelling components** (`property_type`, `built_form_type`) are per-part-
+  capable but today's files supply them once → usually written at `building_part = 0`.
+
+## 5. The hard part: assembling the inputs (this is the real v2 work)
+
+The combiner output (what the v1 finaliser reads) carries **only** address/UPRN
+columns — `Address 1/2/3`, `postcode`, `Internal Reference`, `address2uprn_uprn`,
+`address2uprn_address`, `address2uprn_lexiscore`. The **raw `Walls`/`Roofs`/
+`Property Type`/`Built Form` cells are NOT in it.** They live only in:
+- the **classifier CSV** on S3 — `bulk_onboarding_inputs/{portfolioId}/{uploadId}-classifier.csv` (original landlord headers), and
+- `landlord_*_overrides` in Postgres — the *resolved* values keyed by `(portfolio_id, normalized description)`.
+
+To write one `property_overrides` row, v2 must assemble **four inputs**:
+
+| Need | Source |
+|---|---|
+| `property_id` (identity) | combiner output → `(portfolio_id, uprn)` … **but no-UPRN rows have no key** |
+| raw cell text per row | the classifier CSV (not the combiner output) |
+| split a multi-valued cell → building parts | `multiEntryOrdering` on `bulk_address_uploads` |
+| description → `override_value` | `landlord_*_overrides` (resolve by normalized description) |
+
+### Two open hazards to resolve first (do these before writing code)
+
+1. **Join key between the classifier CSV and the combiner output.** Both derive from
+   the same upload rows, but **row order is NOT preserved** through postcode-split +
+   combine. So you need a stable per-row key present in *both* files. `Internal
+   Reference` is the candidate — **verify it survives into both** the address CSV
+   (→ combiner output) and the classifier CSV. If it doesn't, this is the first thing
+   to fix.
+
+2. **`property_id` for unmatched (no-UPRN) rows.** v1's insert is `onConflictDoNothing`
+   and returns no ids. To attach overrides you need each row's `property.id`. For
+   UPRN rows you can re-select by `(portfolio_id, uprn)`; **no-UPRN rows can't be
+   re-found that way.** Likely fix: change the property insert to `RETURNING id`
+   mapped back to source rows (and decide the dedup/skip semantics for the RETURNING
+   path, since `onConflictDoNothing` returns nothing for conflicting rows).
+
+### Two candidate architectures (evaluate against real sample files)
+
+- **(A) Post-hoc join.** Keep the two files; the finaliser reads the combiner output
+  (UPRN/identity) and the classifier CSV (descriptions) and joins by `Internal
+  Reference`. Splits each multi-valued cell into parts via `multiEntryOrdering`,
+  resolves each part's description against `landlord_*_overrides`, and writes one row
+  per `(property, part, component)`. Lowest pipeline change; depends entirely on a
+  reliable join key.
+- **(B) Carry descriptions through the pipeline.** Include the description columns in
+  the *address* CSV at `start-address-matching` so they flow through `address2uprn`
+  (which preserves input columns via `**row`) into the combiner output. Then the
+  finaliser reads **one** file with UPRN + descriptions in the same row — no join, no
+  key hazard. Costs a change to the address-CSV construction (frontend
+  `start-address-matching` route) and re-verifying `address2uprn`/combiner. Cleaner
+  long-term; bigger blast radius. **Recommended to seriously consider** — it deletes
+  hazard #1 entirely.
+
+## 6. `multiEntryOrdering` — how to split cells into parts
+
+Persisted on `bulk_address_uploads` (`src/app/db/schema/bulk_address_uploads.ts`):
+
+```ts
+MultiEntryOrdering { permutations: Record<string, number[]>; confirmed: boolean }
+// permutations[count][k] = the 0-based FILE position holding building part k
+//   where 0 = Main building, 1..N-1 = Extension 1..N-1.
+// e.g. { "2": [1, 0] } => for 2-entry rows, the main building is file position 1.
+```
+
+A multi-valued cell (e.g. `Walls = "Cavity: …, Solid brick: …"`) splits on commas
+into entries by file position; `permutations[count]` maps file position → building
+part. **Caveat (ADR-0004):** only the **largest count** permutation is captured this
+iteration; other counts need a derivation rule — decide it in v2.
+`multiEntrySummary` holds the detected multi-valued columns + **normalized**
+description keys (the normalization that matches the classifier's stored keys:
+`split → strip → lower`).
+
+## 7. Resolving description → value (`landlord_*_overrides`)
+
+Four per-component tables in `src/app/db/schema/landlord_overrides.ts`
+(`landlord_wall_type_overrides`, `…_roof_type_…`, `…_property_type_…`,
+`…_built_form_type_…`), each `UNIQUE (portfolio_id, description)`, value typed by the
+component's pgEnum, plus a `source` (`classifier`|`user`). Resolve a normalized
+description → `value`. The frontend already does this read in
+`src/lib/bulkUpload/server.ts` (`lookupOverrides`) — mirror that mapping on the
+backend. `UNKNOWN` is a legitimate stored value.
+
+## 8. Backend pieces to build (DDD, mirror v1)
+
+In `/workspaces/home/github/Model`:
+- **`PropertyOverrideRow`** SQLModel mirror → `infrastructure/postgres/property_override_table.py` (mirror the pattern in `property_table.py` / `landlord_*_override_table.py`; reuse a shared `override_component` SAEnum like `landlord_override_enums.py`).
+- **Repository** for the override write (one per aggregate): add to
+  `repositories/property/` (e.g. extend the property repo or a sibling
+  `property_override` repo), with an `upsert_all` using
+  `on_conflict_do_update(index_elements=[property_id, override_component, building_part], …)`.
+- **Orchestrator logic** in `orchestration/bulk_upload_finaliser_orchestrator.py`:
+  extend `finalise(...)` (or add a step) to, after inserting properties and getting
+  ids, build the override rows (join → split by part → resolve) and persist them in
+  the **same** `commit_scope`.
+- **Handler** stays thin — it already wires S3 + engine + repos. It will need the
+  extra input (classifier CSV and/or `multiEntryOrdering`); decide how those reach
+  the Lambda (extend `BulkUploadFinaliserTriggerBody`, or read `bulk_address_uploads`
+  for `multiEntryOrdering` + the classifier S3 URI). The trigger currently carries
+  `task_id, sub_task_id, s3_uri (combiner output), portfolio_id, bulk_upload_id`.
+
+Key v1 files to extend (all in the Model repo):
+- `applications/bulk_upload_finaliser/handler.py`
+- `orchestration/bulk_upload_finaliser_orchestrator.py`
+- `repositories/property/property_repository.py` + `property_postgres_repository.py`
+- `infrastructure/postgres/property_table.py` (reference for the new mirror)
+- `infrastructure/s3/csv_s3_client.py` (`read_rows`)
+- Packaging test: `tests/test_lambda_packaging.py` will flag any new top-level import
+  the Dockerfile doesn't `COPY` (v1 hit this with `datatypes/`).
+
+## 9. Open questions for v2 to decide
+
+- Join key confirmed (`Internal Reference` in both files) — or adopt architecture (B)?
+- `property_id` for no-UPRN rows: `RETURNING id` strategy + dedup semantics.
+- Non-largest-count `multiEntryOrdering` derivation rule (ADR-0004 deferred it).
+- Does the trigger body grow, or does the handler read `bulk_address_uploads`
+  (`multiEntryOrdering`, classifier S3 URI) directly?
+- Re-materialise semantics confirmed: recalculate overrides every finalise (snapshot
+  refreshes), `property` rows untouched.
+
+## 10. First steps in the new context
+
+1. Read §1 docs (esp. ADR-0004) + `CONTEXT.md`.
+2. Get a **real sample**: the combiner output CSV and the `{uploadId}-classifier.csv`
+   for one dev upload, and inspect whether `Internal Reference` is in both → settle
+   hazard #1 / pick architecture (A) vs (B).
+3. Decide the `property_id`-for-no-UPRN approach (hazard #2).
+4. Build `PropertyOverrideRow` + repository + orchestrator step + handler wiring,
+   TDD against fakes (mirror `tests/orchestration/test_bulk_upload_finaliser_orchestrator.py`).
+5. Update `CONTEXT.md` ("Property override" → populated) and add a v2 ADR if the
+   join/architecture choice is a real trade-off.
+
+## 11. Verification notes (environment)
+
+- Frontend: `npx tsc --noEmit` (was 0 errors at v1 close).
+- Model repo: `mypy`/`pytest` need a deps-installed env (the v1 session couldn't run
+  them locally; `/app` Docker config runs the full suite). `terraform plan` needs the
+  CLI. Watch `tests/test_lambda_packaging.py` for Dockerfile COPY gaps.
+- v1 is committed; dev Lambda + SQS queue are deployed and working
+  (`FINALISER_SQS_URL` wired in `backend/.env` for local, and in terraform/fast-api).