diff --git a/docs/adr/0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md b/docs/adr/0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md index 26eef241..572724aa 100644 --- a/docs/adr/0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md +++ b/docs/adr/0030-epc-prediction-validation-is-sap-version-aware-and-component-first.md @@ -42,9 +42,10 @@ No synthetic SAP-weighted Component Accuracy index: weighting components by SAP ### 4. Two validation tiers, one shared scorer - **Tier 1 — committed CI gate.** A small, **anonymized**, frozen fixture under `tests/fixtures/` (addresses hashed — the predictor uses address only as a dedup key — `post_town` dropped; postcode + component fields retained; gov data is OGL). A **ratcheting regression gate**: each per-component floor / residual ceiling is the currently-measured value and only ever *tightens* (honouring the repo's no-tolerance-widening ethos); a regression fails the build. End-to-end SAP / carbon / PE thresholds are loose and explicitly **calculator-floored** — gross-regression guards, not targets. Gates when the fixture is present; skips with a message otherwise. -- **Tier 2 — offline national battle-test.** Built on `harness/epc_bulk.py` (streams the gov **bulk export** via HTTP range requests, filtered by `sap_version`) and `harness/cohort.py` (offline sweep that **captures per-cert raises** instead of aborting). Streams the register and **buckets by postcode** — because bulk is the *whole* register, every postcode is dense, giving national breadth *and* dense cohorts at once. Over tens of thousands of 10.2 targets it emits the Component Accuracy table, the end-to-end MAE, **and a failure taxonomy** (unsupported-schema, mapper raise, calculator raise, no-cohort, no-10.2-target) — the battle-test half. Not committed, not CI-gated; its numbers periodically **re-baseline the Tier-1 floors**. +- **Tier 1.5 — S3-hosted scale run (near-term).** A few-thousand-cert anonymised corpus stored in **S3** rather than committed to git (too large to commit, but far more statistically stable than the 36-target gate fixture). The integration test pulls it to a temp dir and runs the *same* `load_corpus` + `evaluate_component_accuracy`, then reports / asserts. This is the immediate realization of "validate at scale" — it reuses the committed-fixture machinery wholesale (only the data *source* differs) and needs no bulk-export streaming. Its numbers re-baseline the Tier-1 floors. +- **Tier 2 — offline national battle-test (deferred).** Built on `harness/epc_bulk.py` (streams the gov **bulk export** via HTTP range requests, filtered by `sap_version`) and `harness/cohort.py` (offline sweep that **captures per-cert raises** instead of aborting). Streams the register and **buckets by postcode** — because bulk is the *whole* register, every postcode is dense, giving national breadth *and* dense cohorts at once. Over tens of thousands of 10.2 targets it emits the Component Accuracy table, the end-to-end MAE, **and a failure taxonomy** (unsupported-schema, mapper raise, calculator raise, no-cohort, no-10.2-target) — the battle-test half. Not committed, not CI-gated; its numbers periodically **re-baseline the Tier-1 floors**. -Both tiers run the *same* `compare_prediction` + calculator logic — one scorer, two harnesses. +All tiers run the *same* `evaluate_component_accuracy` / `compare_prediction` logic over `load_corpus` — one scorer, three data sources (committed fixture, S3 corpus, bulk stream). ## Consequences