Add v2 handover doc for property_overrides population

Self-contained brief to start a fresh context implementing the per-Property fact
layer in the bulk_upload_finaliser: locked decisions, the four-input assembly,
the two open hazards (classifier-CSV↔combiner-output join key; property_id for
no-UPRN rows), candidate architectures, and a first-steps list. v1 (async
finalise writing `property`) is shipped and working.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Jun-te Kim 2026-06-04 18:21:42 +00:00
parent b9bbd916b5
commit 744bfa2287

View file

@ -0,0 +1,208 @@
# Handover: `bulk_upload_finaliser` v2 — populate `property_overrides`
> **Purpose.** Self-contained brief to start a fresh context implementing v2 (the
> per-Property fact layer) in the existing `bulk_upload_finaliser` Lambda. v1
> (async finalise that writes `property`) is **shipped and working end-to-end**.
> This doc assumes no memory of the v1 session.
## 1. Where v1 left things (read first)
v1 made **Finalise** an async dispatched Lambda that writes `property` rows. The
full flow works in dev: dispatch `202` → SQS → Lambda inserts properties + writes
terminal status → UI advances to "Processing complete".
Authoritative background — **read these before coding**:
- `docs/design/bulk-upload-finaliser.md` — the full grilling/design doc (schema Q6Q9, snapshot-not-FK, recalculate-on-rerun, the v2 input hazards).
- `docs/adr/0005-async-bulk-upload-finaliser.md` (frontend) — state machine + `property_overrides` shape.
- `/workspaces/home/github/Model/docs/adr/0013-bulk-upload-finaliser-writes-properties.md` (backend) — the Lambda write path + DDD layering.
- `docs/adr/0004-multi-entry-building-part-ordering.md`**critical for v2**: how building parts and `multiEntryOrdering` work.
- `docs/adr/0002-landlord-override-vocabulary.md` — the vocabulary (`landlord_*_overrides`) layer v2 resolves against.
- `CONTEXT.md` — glossary: **Property override**, **Building part**, **Main building**, **Extension**, **Multi-entry**, **Building-part ordering**, **VocabularyMapping**.
**Convention that must hold (it was corrected hard in v1):** in the Model repo,
business logic lives in `orchestration/*_orchestrator.py`; the Lambda
`applications/*/handler.py` stays thin (parse trigger, wire infra, delegate). One
repository per aggregate; orchestrators never commit (the handler owns the
transaction via `commit_scope`). See memory `model-ddd-layering`.
## 2. v2 goal
Populate `property_overrides` during finalise: for each property, write one row per
`(building_part, override_component)` carrying the **resolved enum snapshot** of the
landlord's description for that part.
## 3. The target table (already shipped — migration 0221, do NOT re-migrate)
Drizzle: `src/app/db/schema/property_overrides.ts`.
```
property_overrides
id uuid pk
property_id bigint NOT NULL FK → property.id ON DELETE CASCADE
portfolio_id bigint NOT NULL FK → portfolio.id ON DELETE CASCADE
building_part smallint NOT NULL -- 0 = main, 1 = ext 1, 2 = ext 2, …
override_component override_component NOT NULL -- pgEnum {wall_type, roof_type, property_type, built_form_type}
override_value text NOT NULL -- snapshot of the resolved enum value
original_spreadsheet_description text NOT NULL -- raw cell text it resolved from
created_at / updated_at timestamptz NOT NULL
UNIQUE (property_id, override_component, building_part)
```
## 4. Design decisions already locked (do not relitigate)
- **Snapshot, not FK.** `override_value` is a denormalised text copy of the resolved
enum, taken at materialise time — *not* an FK to `landlord_*_overrides`. This is
what lets two properties sharing a description diverge later, and is required
because there are four polymorphic vocabulary tables. Lineage is the natural key
`(portfolio_id, override_component, original_spreadsheet_description)`.
- **Re-run = recalculate.** Write with `onConflictDoUpdate` on
`(property_id, override_component, building_part)`, refreshing `override_value` +
`original_spreadsheet_description` + `updated_at`. (Contrast `property`, which is
`onConflictDoNothing`.) When a per-property user-edit path eventually exists, this
upsert will need a `WHERE source='classifier'` guard — but there is **no `source`
column in v1**; add it as a nullable column only when that path is built.
- **`override_component` values** are exactly the classifier category keys
(`wall_type`, `roof_type`, `property_type`, `built_form_type`) used in both
`src/lib/bulkUpload/columnFields.ts` and the Model `ClassifiableColumn.name` — no
translation.
- **`building_part` indexing**: `0 = Main building`, `1 = Extension 1`, … per ADR-0004.
- **Whole-dwelling components** (`property_type`, `built_form_type`) are per-part-
capable but today's files supply them once → usually written at `building_part = 0`.
## 5. The hard part: assembling the inputs (this is the real v2 work)
The combiner output (what the v1 finaliser reads) carries **only** address/UPRN
columns — `Address 1/2/3`, `postcode`, `Internal Reference`, `address2uprn_uprn`,
`address2uprn_address`, `address2uprn_lexiscore`. The **raw `Walls`/`Roofs`/
`Property Type`/`Built Form` cells are NOT in it.** They live only in:
- the **classifier CSV** on S3 — `bulk_onboarding_inputs/{portfolioId}/{uploadId}-classifier.csv` (original landlord headers), and
- `landlord_*_overrides` in Postgres — the *resolved* values keyed by `(portfolio_id, normalized description)`.
To write one `property_overrides` row, v2 must assemble **four inputs**:
| Need | Source |
|---|---|
| `property_id` (identity) | combiner output → `(portfolio_id, uprn)`**but no-UPRN rows have no key** |
| raw cell text per row | the classifier CSV (not the combiner output) |
| split a multi-valued cell → building parts | `multiEntryOrdering` on `bulk_address_uploads` |
| description → `override_value` | `landlord_*_overrides` (resolve by normalized description) |
### Two open hazards to resolve first (do these before writing code)
1. **Join key between the classifier CSV and the combiner output.** Both derive from
the same upload rows, but **row order is NOT preserved** through postcode-split +
combine. So you need a stable per-row key present in *both* files. `Internal
Reference` is the candidate — **verify it survives into both** the address CSV
(→ combiner output) and the classifier CSV. If it doesn't, this is the first thing
to fix.
2. **`property_id` for unmatched (no-UPRN) rows.** v1's insert is `onConflictDoNothing`
and returns no ids. To attach overrides you need each row's `property.id`. For
UPRN rows you can re-select by `(portfolio_id, uprn)`; **no-UPRN rows can't be
re-found that way.** Likely fix: change the property insert to `RETURNING id`
mapped back to source rows (and decide the dedup/skip semantics for the RETURNING
path, since `onConflictDoNothing` returns nothing for conflicting rows).
### Two candidate architectures (evaluate against real sample files)
- **(A) Post-hoc join.** Keep the two files; the finaliser reads the combiner output
(UPRN/identity) and the classifier CSV (descriptions) and joins by `Internal
Reference`. Splits each multi-valued cell into parts via `multiEntryOrdering`,
resolves each part's description against `landlord_*_overrides`, and writes one row
per `(property, part, component)`. Lowest pipeline change; depends entirely on a
reliable join key.
- **(B) Carry descriptions through the pipeline.** Include the description columns in
the *address* CSV at `start-address-matching` so they flow through `address2uprn`
(which preserves input columns via `**row`) into the combiner output. Then the
finaliser reads **one** file with UPRN + descriptions in the same row — no join, no
key hazard. Costs a change to the address-CSV construction (frontend
`start-address-matching` route) and re-verifying `address2uprn`/combiner. Cleaner
long-term; bigger blast radius. **Recommended to seriously consider** — it deletes
hazard #1 entirely.
## 6. `multiEntryOrdering` — how to split cells into parts
Persisted on `bulk_address_uploads` (`src/app/db/schema/bulk_address_uploads.ts`):
```ts
MultiEntryOrdering { permutations: Record<string, number[]>; confirmed: boolean }
// permutations[count][k] = the 0-based FILE position holding building part k
// where 0 = Main building, 1..N-1 = Extension 1..N-1.
// e.g. { "2": [1, 0] } => for 2-entry rows, the main building is file position 1.
```
A multi-valued cell (e.g. `Walls = "Cavity: …, Solid brick: …"`) splits on commas
into entries by file position; `permutations[count]` maps file position → building
part. **Caveat (ADR-0004):** only the **largest count** permutation is captured this
iteration; other counts need a derivation rule — decide it in v2.
`multiEntrySummary` holds the detected multi-valued columns + **normalized**
description keys (the normalization that matches the classifier's stored keys:
`split → strip → lower`).
## 7. Resolving description → value (`landlord_*_overrides`)
Four per-component tables in `src/app/db/schema/landlord_overrides.ts`
(`landlord_wall_type_overrides`, `…_roof_type_…`, `…_property_type_…`,
`…_built_form_type_…`), each `UNIQUE (portfolio_id, description)`, value typed by the
component's pgEnum, plus a `source` (`classifier`|`user`). Resolve a normalized
description → `value`. The frontend already does this read in
`src/lib/bulkUpload/server.ts` (`lookupOverrides`) — mirror that mapping on the
backend. `UNKNOWN` is a legitimate stored value.
## 8. Backend pieces to build (DDD, mirror v1)
In `/workspaces/home/github/Model`:
- **`PropertyOverrideRow`** SQLModel mirror → `infrastructure/postgres/property_override_table.py` (mirror the pattern in `property_table.py` / `landlord_*_override_table.py`; reuse a shared `override_component` SAEnum like `landlord_override_enums.py`).
- **Repository** for the override write (one per aggregate): add to
`repositories/property/` (e.g. extend the property repo or a sibling
`property_override` repo), with an `upsert_all` using
`on_conflict_do_update(index_elements=[property_id, override_component, building_part], …)`.
- **Orchestrator logic** in `orchestration/bulk_upload_finaliser_orchestrator.py`:
extend `finalise(...)` (or add a step) to, after inserting properties and getting
ids, build the override rows (join → split by part → resolve) and persist them in
the **same** `commit_scope`.
- **Handler** stays thin — it already wires S3 + engine + repos. It will need the
extra input (classifier CSV and/or `multiEntryOrdering`); decide how those reach
the Lambda (extend `BulkUploadFinaliserTriggerBody`, or read `bulk_address_uploads`
for `multiEntryOrdering` + the classifier S3 URI). The trigger currently carries
`task_id, sub_task_id, s3_uri (combiner output), portfolio_id, bulk_upload_id`.
Key v1 files to extend (all in the Model repo):
- `applications/bulk_upload_finaliser/handler.py`
- `orchestration/bulk_upload_finaliser_orchestrator.py`
- `repositories/property/property_repository.py` + `property_postgres_repository.py`
- `infrastructure/postgres/property_table.py` (reference for the new mirror)
- `infrastructure/s3/csv_s3_client.py` (`read_rows`)
- Packaging test: `tests/test_lambda_packaging.py` will flag any new top-level import
the Dockerfile doesn't `COPY` (v1 hit this with `datatypes/`).
## 9. Open questions for v2 to decide
- Join key confirmed (`Internal Reference` in both files) — or adopt architecture (B)?
- `property_id` for no-UPRN rows: `RETURNING id` strategy + dedup semantics.
- Non-largest-count `multiEntryOrdering` derivation rule (ADR-0004 deferred it).
- Does the trigger body grow, or does the handler read `bulk_address_uploads`
(`multiEntryOrdering`, classifier S3 URI) directly?
- Re-materialise semantics confirmed: recalculate overrides every finalise (snapshot
refreshes), `property` rows untouched.
## 10. First steps in the new context
1. Read §1 docs (esp. ADR-0004) + `CONTEXT.md`.
2. Get a **real sample**: the combiner output CSV and the `{uploadId}-classifier.csv`
for one dev upload, and inspect whether `Internal Reference` is in both → settle
hazard #1 / pick architecture (A) vs (B).
3. Decide the `property_id`-for-no-UPRN approach (hazard #2).
4. Build `PropertyOverrideRow` + repository + orchestrator step + handler wiring,
TDD against fakes (mirror `tests/orchestration/test_bulk_upload_finaliser_orchestrator.py`).
5. Update `CONTEXT.md` ("Property override" → populated) and add a v2 ADR if the
join/architecture choice is a real trade-off.
## 11. Verification notes (environment)
- Frontend: `npx tsc --noEmit` (was 0 errors at v1 close).
- Model repo: `mypy`/`pytest` need a deps-installed env (the v1 session couldn't run
them locally; `/app` Docker config runs the full suite). `terraform plan` needs the
CLI. Watch `tests/test_lambda_packaging.py` for Dockerfile COPY gaps.
- v1 is committed; dev Lambda + SQS queue are deployed and working
(`FINALISER_SQS_URL` wired in `backend/.env` for local, and in terraform/fast-api).