mirror of
https://github.com/Hestia-Homes/assessment-model.git
synced 2026-06-30 12:55:02 +00:00
Add v2 handover doc for property_overrides population
Self-contained brief to start a fresh context implementing the per-Property fact layer in the bulk_upload_finaliser: locked decisions, the four-input assembly, the two open hazards (classifier-CSV↔combiner-output join key; property_id for no-UPRN rows), candidate architectures, and a first-steps list. v1 (async finalise writing `property`) is shipped and working. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
parent
b9bbd916b5
commit
744bfa2287
1 changed files with 208 additions and 0 deletions
208
docs/design/bulk-upload-finaliser-v2-handover.md
Normal file
208
docs/design/bulk-upload-finaliser-v2-handover.md
Normal file
|
|
@ -0,0 +1,208 @@
|
|||
# Handover: `bulk_upload_finaliser` v2 — populate `property_overrides`
|
||||
|
||||
> **Purpose.** Self-contained brief to start a fresh context implementing v2 (the
|
||||
> per-Property fact layer) in the existing `bulk_upload_finaliser` Lambda. v1
|
||||
> (async finalise that writes `property`) is **shipped and working end-to-end**.
|
||||
> This doc assumes no memory of the v1 session.
|
||||
|
||||
## 1. Where v1 left things (read first)
|
||||
|
||||
v1 made **Finalise** an async dispatched Lambda that writes `property` rows. The
|
||||
full flow works in dev: dispatch `202` → SQS → Lambda inserts properties + writes
|
||||
terminal status → UI advances to "Processing complete".
|
||||
|
||||
Authoritative background — **read these before coding**:
|
||||
- `docs/design/bulk-upload-finaliser.md` — the full grilling/design doc (schema Q6–Q9, snapshot-not-FK, recalculate-on-rerun, the v2 input hazards).
|
||||
- `docs/adr/0005-async-bulk-upload-finaliser.md` (frontend) — state machine + `property_overrides` shape.
|
||||
- `/workspaces/home/github/Model/docs/adr/0013-bulk-upload-finaliser-writes-properties.md` (backend) — the Lambda write path + DDD layering.
|
||||
- `docs/adr/0004-multi-entry-building-part-ordering.md` — **critical for v2**: how building parts and `multiEntryOrdering` work.
|
||||
- `docs/adr/0002-landlord-override-vocabulary.md` — the vocabulary (`landlord_*_overrides`) layer v2 resolves against.
|
||||
- `CONTEXT.md` — glossary: **Property override**, **Building part**, **Main building**, **Extension**, **Multi-entry**, **Building-part ordering**, **VocabularyMapping**.
|
||||
|
||||
**Convention that must hold (it was corrected hard in v1):** in the Model repo,
|
||||
business logic lives in `orchestration/*_orchestrator.py`; the Lambda
|
||||
`applications/*/handler.py` stays thin (parse trigger, wire infra, delegate). One
|
||||
repository per aggregate; orchestrators never commit (the handler owns the
|
||||
transaction via `commit_scope`). See memory `model-ddd-layering`.
|
||||
|
||||
## 2. v2 goal
|
||||
|
||||
Populate `property_overrides` during finalise: for each property, write one row per
|
||||
`(building_part, override_component)` carrying the **resolved enum snapshot** of the
|
||||
landlord's description for that part.
|
||||
|
||||
## 3. The target table (already shipped — migration 0221, do NOT re-migrate)
|
||||
|
||||
Drizzle: `src/app/db/schema/property_overrides.ts`.
|
||||
|
||||
```
|
||||
property_overrides
|
||||
id uuid pk
|
||||
property_id bigint NOT NULL FK → property.id ON DELETE CASCADE
|
||||
portfolio_id bigint NOT NULL FK → portfolio.id ON DELETE CASCADE
|
||||
building_part smallint NOT NULL -- 0 = main, 1 = ext 1, 2 = ext 2, …
|
||||
override_component override_component NOT NULL -- pgEnum {wall_type, roof_type, property_type, built_form_type}
|
||||
override_value text NOT NULL -- snapshot of the resolved enum value
|
||||
original_spreadsheet_description text NOT NULL -- raw cell text it resolved from
|
||||
created_at / updated_at timestamptz NOT NULL
|
||||
UNIQUE (property_id, override_component, building_part)
|
||||
```
|
||||
|
||||
## 4. Design decisions already locked (do not relitigate)
|
||||
|
||||
- **Snapshot, not FK.** `override_value` is a denormalised text copy of the resolved
|
||||
enum, taken at materialise time — *not* an FK to `landlord_*_overrides`. This is
|
||||
what lets two properties sharing a description diverge later, and is required
|
||||
because there are four polymorphic vocabulary tables. Lineage is the natural key
|
||||
`(portfolio_id, override_component, original_spreadsheet_description)`.
|
||||
- **Re-run = recalculate.** Write with `onConflictDoUpdate` on
|
||||
`(property_id, override_component, building_part)`, refreshing `override_value` +
|
||||
`original_spreadsheet_description` + `updated_at`. (Contrast `property`, which is
|
||||
`onConflictDoNothing`.) When a per-property user-edit path eventually exists, this
|
||||
upsert will need a `WHERE source='classifier'` guard — but there is **no `source`
|
||||
column in v1**; add it as a nullable column only when that path is built.
|
||||
- **`override_component` values** are exactly the classifier category keys
|
||||
(`wall_type`, `roof_type`, `property_type`, `built_form_type`) used in both
|
||||
`src/lib/bulkUpload/columnFields.ts` and the Model `ClassifiableColumn.name` — no
|
||||
translation.
|
||||
- **`building_part` indexing**: `0 = Main building`, `1 = Extension 1`, … per ADR-0004.
|
||||
- **Whole-dwelling components** (`property_type`, `built_form_type`) are per-part-
|
||||
capable but today's files supply them once → usually written at `building_part = 0`.
|
||||
|
||||
## 5. The hard part: assembling the inputs (this is the real v2 work)
|
||||
|
||||
The combiner output (what the v1 finaliser reads) carries **only** address/UPRN
|
||||
columns — `Address 1/2/3`, `postcode`, `Internal Reference`, `address2uprn_uprn`,
|
||||
`address2uprn_address`, `address2uprn_lexiscore`. The **raw `Walls`/`Roofs`/
|
||||
`Property Type`/`Built Form` cells are NOT in it.** They live only in:
|
||||
- the **classifier CSV** on S3 — `bulk_onboarding_inputs/{portfolioId}/{uploadId}-classifier.csv` (original landlord headers), and
|
||||
- `landlord_*_overrides` in Postgres — the *resolved* values keyed by `(portfolio_id, normalized description)`.
|
||||
|
||||
To write one `property_overrides` row, v2 must assemble **four inputs**:
|
||||
|
||||
| Need | Source |
|
||||
|---|---|
|
||||
| `property_id` (identity) | combiner output → `(portfolio_id, uprn)` … **but no-UPRN rows have no key** |
|
||||
| raw cell text per row | the classifier CSV (not the combiner output) |
|
||||
| split a multi-valued cell → building parts | `multiEntryOrdering` on `bulk_address_uploads` |
|
||||
| description → `override_value` | `landlord_*_overrides` (resolve by normalized description) |
|
||||
|
||||
### Two open hazards to resolve first (do these before writing code)
|
||||
|
||||
1. **Join key between the classifier CSV and the combiner output.** Both derive from
|
||||
the same upload rows, but **row order is NOT preserved** through postcode-split +
|
||||
combine. So you need a stable per-row key present in *both* files. `Internal
|
||||
Reference` is the candidate — **verify it survives into both** the address CSV
|
||||
(→ combiner output) and the classifier CSV. If it doesn't, this is the first thing
|
||||
to fix.
|
||||
|
||||
2. **`property_id` for unmatched (no-UPRN) rows.** v1's insert is `onConflictDoNothing`
|
||||
and returns no ids. To attach overrides you need each row's `property.id`. For
|
||||
UPRN rows you can re-select by `(portfolio_id, uprn)`; **no-UPRN rows can't be
|
||||
re-found that way.** Likely fix: change the property insert to `RETURNING id`
|
||||
mapped back to source rows (and decide the dedup/skip semantics for the RETURNING
|
||||
path, since `onConflictDoNothing` returns nothing for conflicting rows).
|
||||
|
||||
### Two candidate architectures (evaluate against real sample files)
|
||||
|
||||
- **(A) Post-hoc join.** Keep the two files; the finaliser reads the combiner output
|
||||
(UPRN/identity) and the classifier CSV (descriptions) and joins by `Internal
|
||||
Reference`. Splits each multi-valued cell into parts via `multiEntryOrdering`,
|
||||
resolves each part's description against `landlord_*_overrides`, and writes one row
|
||||
per `(property, part, component)`. Lowest pipeline change; depends entirely on a
|
||||
reliable join key.
|
||||
- **(B) Carry descriptions through the pipeline.** Include the description columns in
|
||||
the *address* CSV at `start-address-matching` so they flow through `address2uprn`
|
||||
(which preserves input columns via `**row`) into the combiner output. Then the
|
||||
finaliser reads **one** file with UPRN + descriptions in the same row — no join, no
|
||||
key hazard. Costs a change to the address-CSV construction (frontend
|
||||
`start-address-matching` route) and re-verifying `address2uprn`/combiner. Cleaner
|
||||
long-term; bigger blast radius. **Recommended to seriously consider** — it deletes
|
||||
hazard #1 entirely.
|
||||
|
||||
## 6. `multiEntryOrdering` — how to split cells into parts
|
||||
|
||||
Persisted on `bulk_address_uploads` (`src/app/db/schema/bulk_address_uploads.ts`):
|
||||
|
||||
```ts
|
||||
MultiEntryOrdering { permutations: Record<string, number[]>; confirmed: boolean }
|
||||
// permutations[count][k] = the 0-based FILE position holding building part k
|
||||
// where 0 = Main building, 1..N-1 = Extension 1..N-1.
|
||||
// e.g. { "2": [1, 0] } => for 2-entry rows, the main building is file position 1.
|
||||
```
|
||||
|
||||
A multi-valued cell (e.g. `Walls = "Cavity: …, Solid brick: …"`) splits on commas
|
||||
into entries by file position; `permutations[count]` maps file position → building
|
||||
part. **Caveat (ADR-0004):** only the **largest count** permutation is captured this
|
||||
iteration; other counts need a derivation rule — decide it in v2.
|
||||
`multiEntrySummary` holds the detected multi-valued columns + **normalized**
|
||||
description keys (the normalization that matches the classifier's stored keys:
|
||||
`split → strip → lower`).
|
||||
|
||||
## 7. Resolving description → value (`landlord_*_overrides`)
|
||||
|
||||
Four per-component tables in `src/app/db/schema/landlord_overrides.ts`
|
||||
(`landlord_wall_type_overrides`, `…_roof_type_…`, `…_property_type_…`,
|
||||
`…_built_form_type_…`), each `UNIQUE (portfolio_id, description)`, value typed by the
|
||||
component's pgEnum, plus a `source` (`classifier`|`user`). Resolve a normalized
|
||||
description → `value`. The frontend already does this read in
|
||||
`src/lib/bulkUpload/server.ts` (`lookupOverrides`) — mirror that mapping on the
|
||||
backend. `UNKNOWN` is a legitimate stored value.
|
||||
|
||||
## 8. Backend pieces to build (DDD, mirror v1)
|
||||
|
||||
In `/workspaces/home/github/Model`:
|
||||
- **`PropertyOverrideRow`** SQLModel mirror → `infrastructure/postgres/property_override_table.py` (mirror the pattern in `property_table.py` / `landlord_*_override_table.py`; reuse a shared `override_component` SAEnum like `landlord_override_enums.py`).
|
||||
- **Repository** for the override write (one per aggregate): add to
|
||||
`repositories/property/` (e.g. extend the property repo or a sibling
|
||||
`property_override` repo), with an `upsert_all` using
|
||||
`on_conflict_do_update(index_elements=[property_id, override_component, building_part], …)`.
|
||||
- **Orchestrator logic** in `orchestration/bulk_upload_finaliser_orchestrator.py`:
|
||||
extend `finalise(...)` (or add a step) to, after inserting properties and getting
|
||||
ids, build the override rows (join → split by part → resolve) and persist them in
|
||||
the **same** `commit_scope`.
|
||||
- **Handler** stays thin — it already wires S3 + engine + repos. It will need the
|
||||
extra input (classifier CSV and/or `multiEntryOrdering`); decide how those reach
|
||||
the Lambda (extend `BulkUploadFinaliserTriggerBody`, or read `bulk_address_uploads`
|
||||
for `multiEntryOrdering` + the classifier S3 URI). The trigger currently carries
|
||||
`task_id, sub_task_id, s3_uri (combiner output), portfolio_id, bulk_upload_id`.
|
||||
|
||||
Key v1 files to extend (all in the Model repo):
|
||||
- `applications/bulk_upload_finaliser/handler.py`
|
||||
- `orchestration/bulk_upload_finaliser_orchestrator.py`
|
||||
- `repositories/property/property_repository.py` + `property_postgres_repository.py`
|
||||
- `infrastructure/postgres/property_table.py` (reference for the new mirror)
|
||||
- `infrastructure/s3/csv_s3_client.py` (`read_rows`)
|
||||
- Packaging test: `tests/test_lambda_packaging.py` will flag any new top-level import
|
||||
the Dockerfile doesn't `COPY` (v1 hit this with `datatypes/`).
|
||||
|
||||
## 9. Open questions for v2 to decide
|
||||
|
||||
- Join key confirmed (`Internal Reference` in both files) — or adopt architecture (B)?
|
||||
- `property_id` for no-UPRN rows: `RETURNING id` strategy + dedup semantics.
|
||||
- Non-largest-count `multiEntryOrdering` derivation rule (ADR-0004 deferred it).
|
||||
- Does the trigger body grow, or does the handler read `bulk_address_uploads`
|
||||
(`multiEntryOrdering`, classifier S3 URI) directly?
|
||||
- Re-materialise semantics confirmed: recalculate overrides every finalise (snapshot
|
||||
refreshes), `property` rows untouched.
|
||||
|
||||
## 10. First steps in the new context
|
||||
|
||||
1. Read §1 docs (esp. ADR-0004) + `CONTEXT.md`.
|
||||
2. Get a **real sample**: the combiner output CSV and the `{uploadId}-classifier.csv`
|
||||
for one dev upload, and inspect whether `Internal Reference` is in both → settle
|
||||
hazard #1 / pick architecture (A) vs (B).
|
||||
3. Decide the `property_id`-for-no-UPRN approach (hazard #2).
|
||||
4. Build `PropertyOverrideRow` + repository + orchestrator step + handler wiring,
|
||||
TDD against fakes (mirror `tests/orchestration/test_bulk_upload_finaliser_orchestrator.py`).
|
||||
5. Update `CONTEXT.md` ("Property override" → populated) and add a v2 ADR if the
|
||||
join/architecture choice is a real trade-off.
|
||||
|
||||
## 11. Verification notes (environment)
|
||||
|
||||
- Frontend: `npx tsc --noEmit` (was 0 errors at v1 close).
|
||||
- Model repo: `mypy`/`pytest` need a deps-installed env (the v1 session couldn't run
|
||||
them locally; `/app` Docker config runs the full suite). `terraform plan` needs the
|
||||
CLI. Watch `tests/test_lambda_packaging.py` for Dockerfile COPY gaps.
|
||||
- v1 is committed; dev Lambda + SQS queue are deployed and working
|
||||
(`FINALISER_SQS_URL` wired in `backend/.env` for local, and in terraform/fast-api).
|
||||
Loading…
Add table
Reference in a new issue