docs(adr): ADR-0030 — record S3-hosted Tier-1.5 scale run

Tier-2 (full national bulk streaming) is deferred. The near-term scale
validation is a Tier-1.5: a few-thousand-cert anonymised corpus stored in
S3 (too large to commit, far more stable than the 36-target gate fixture),
pulled to a temp dir and run through the same load_corpus +
evaluate_component_accuracy. Reuses the committed-fixture machinery wholesale
— only the data source differs. One scorer, three data sources (committed
fixture / S3 corpus / bulk stream).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-06-14 09:25:58 +00:00
parent e3a2720e5c
commit a622f97d27

View file

@ -42,9 +42,10 @@ No synthetic SAP-weighted Component Accuracy index: weighting components by SAP
### 4. Two validation tiers, one shared scorer
- **Tier 1 — committed CI gate.** A small, **anonymized**, frozen fixture under `tests/fixtures/` (addresses hashed — the predictor uses address only as a dedup key — `post_town` dropped; postcode + component fields retained; gov data is OGL). A **ratcheting regression gate**: each per-component floor / residual ceiling is the currently-measured value and only ever *tightens* (honouring the repo's no-tolerance-widening ethos); a regression fails the build. End-to-end SAP / carbon / PE thresholds are loose and explicitly **calculator-floored** — gross-regression guards, not targets. Gates when the fixture is present; skips with a message otherwise.
- **Tier 2 — offline national battle-test.** Built on `harness/epc_bulk.py` (streams the gov **bulk export** via HTTP range requests, filtered by `sap_version`) and `harness/cohort.py` (offline sweep that **captures per-cert raises** instead of aborting). Streams the register and **buckets by postcode** — because bulk is the *whole* register, every postcode is dense, giving national breadth *and* dense cohorts at once. Over tens of thousands of 10.2 targets it emits the Component Accuracy table, the end-to-end MAE, **and a failure taxonomy** (unsupported-schema, mapper raise, calculator raise, no-cohort, no-10.2-target) — the battle-test half. Not committed, not CI-gated; its numbers periodically **re-baseline the Tier-1 floors**.
- **Tier 1.5 — S3-hosted scale run (near-term).** A few-thousand-cert anonymised corpus stored in **S3** rather than committed to git (too large to commit, but far more statistically stable than the 36-target gate fixture). The integration test pulls it to a temp dir and runs the *same* `load_corpus` + `evaluate_component_accuracy`, then reports / asserts. This is the immediate realization of "validate at scale" — it reuses the committed-fixture machinery wholesale (only the data *source* differs) and needs no bulk-export streaming. Its numbers re-baseline the Tier-1 floors.
- **Tier 2 — offline national battle-test (deferred).** Built on `harness/epc_bulk.py` (streams the gov **bulk export** via HTTP range requests, filtered by `sap_version`) and `harness/cohort.py` (offline sweep that **captures per-cert raises** instead of aborting). Streams the register and **buckets by postcode** — because bulk is the *whole* register, every postcode is dense, giving national breadth *and* dense cohorts at once. Over tens of thousands of 10.2 targets it emits the Component Accuracy table, the end-to-end MAE, **and a failure taxonomy** (unsupported-schema, mapper raise, calculator raise, no-cohort, no-10.2-target) — the battle-test half. Not committed, not CI-gated; its numbers periodically **re-baseline the Tier-1 floors**.
Both tiers run the *same* `compare_prediction` + calculator logic — one scorer, two harnesses.
All tiers run the *same* `evaluate_component_accuracy` / `compare_prediction` logic over `load_corpus` — one scorer, three data sources (committed fixture, S3 corpus, bulk stream).
## Consequences