From cc0e875fd865f75ada24c4b149d89461ac43fe2a Mon Sep 17 00:00:00 2001 From: Jun-te Kim Date: Thu, 11 Jun 2026 12:06:03 +0000 Subject: [PATCH] Record pre-SAP10 RdSAP family coefficient transfer (ADR-0028) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Documents the inherit-and-validate decision for 18.0: reuse 20.0.0's 0.148 + band multipliers (the corpus can't self-fit — 958/1000 band-1 with no measured band-1 windows), validated against 18.0's own band-4 rich certs (0.223 obs vs 0.148 x 1.51 pred). References ADR-0027 one-way (keeps the accepted ADR immutable). Co-Authored-By: Claude Opus 4.8 (1M context) --- ...sap10-rdsap-family-coefficient-transfer.md | 106 ++++++++++++++++++ 1 file changed, 106 insertions(+) create mode 100644 docs/adr/0028-pre-sap10-rdsap-family-coefficient-transfer.md diff --git a/docs/adr/0028-pre-sap10-rdsap-family-coefficient-transfer.md b/docs/adr/0028-pre-sap10-rdsap-family-coefficient-transfer.md new file mode 100644 index 00000000..7bdfbc6e --- /dev/null +++ b/docs/adr/0028-pre-sap10-rdsap-family-coefficient-transfer.md @@ -0,0 +1,106 @@ +--- +Status: accepted +--- + +# Pre-SAP10 RdSAP coefficients transfer across the family: inherit-and-validate, starting with 18.0 + +Decided in a `/grill-me` session (2026-06-11). **Extends** [ADR-0027](0027-rdsap-20-0-0-reduced-field-synthesis.md) +(RdSAP 20.0.0 Reduced-Field Synthesis) from a single spec to the wider **pre-SAP10 RdSAP family**; +sits inside the **old-schema re-map** half of **Rebaselining** ([CONTEXT.md](../../CONTEXT.md): +_Effective EPC_, _Rebaselining_, _Reduced-Field Synthesis_, _Validation Cohort_, _Spec Version_). +Relates to [ADR-0015](0015-mappers-own-cert-normalization.md) (mappers own cert normalization) and +[ADR-0004](0004-baseline-performance-lodged-effective-pair.md) (lodged-vs-effective pair). Grill spec: +[docs/grill-sessions/2026-06-10-pre-sap10-mapper-generalization.md](../grill-sessions/2026-06-10-pre-sap10-mapper-generalization.md). + +## Context + +ADR-0027 proved Reduced-Field Synthesis end-to-end for `RdSAP-Schema-20.0.0`. The pre-SAP10 RdSAP +family has more orphaned siblings (`19.0`, `18.0`, `17.1`, `17.0`) whose mapper methods exist but are +unreachable (`from_api_response` never dispatches to them) and whose placeholder schemas over-constrain +identically. We want each re-mapped to the current `EpcPropertyData` so its historical certs can be +**Rebaselined**. This ADR records the *family-level* coefficient decision; `18.0` is the first instance +and the worked example. (Order set by direction 2026-06-11: **18.0 alone, end-to-end, first**; `17.1` +is a separate later effort.) + +ADR-0027 left one question open for the rest of the family: do later pre-SAP10 specs **reuse** 0027's +fitted coefficients (`0.148 × total_floor_area × band_multiplier`, multipliers +`{Normal 1.00, More 1.25, Less 0.81, MuchMore 1.51, MuchLess 0.62}`), or **re-fit** per spec? The +initial direction (2026-06-10) was *re-fit from each new corpus's own data — do not inherit by default*. +Profiling the harvested `18.0` corpus (1000 certs from `certificates-2018.json`, ~82% of that dump) +showed why a literal re-fit is **not achievable**, and — more usefully — that it is **not necessary**: + +- **The corpus cannot self-fit the glazing/floor ratio.** A reduced schema records `glazed_area` as a + *band*, not per-window m². `18.0`'s population is **958/1000 band-1 (Normal)**, and only **10/1000** + carry a lodged `sap_windows` array at all. So there is no measured glazing column to regress on for + the band that dominates the stock — the exact constraint ADR-0027 anticipated. +- **The 10 rich certs are systematically the outliers, not a representative sample.** They are + **9× band-4 ("Much More Than Typical") + 1× band-5 ("Much Less")**, with **zero band-1**. The + dwellings that bother to lodge full per-window geometry are the unusually-glazed ones. A "fit" off + these would measure band-4 dwellings, then dividing by the band-4 multiplier (1.51) only reconstructs + `0.148` — circular. +- **Where the corpus *can* be measured, it reproduces ADR-0027's model almost exactly:** + + | Band | 18.0 observed glazing/floor (n) | ADR-0027 predicts (`0.148 × mult`) | + |------|---------------------------------|------------------------------------| + | 4 (MuchMore, ×1.51) | **0.223** (n=9) | **0.223** | + | 5 (MuchLess, ×0.62) | **0.086** (n=1) | **0.092** | + + So the new corpus's own data **validates** the inherited coefficients rather than contradicting them. +- **Integer code spaces are identical.** `built_form`, `glazed_area`, `glazed_type`, and + `mechanical_ventilation` were diffed against `datatypes/epc/domain/epc_codes.csv` for + `18.0` / `17.1` / `20.0.0` / `21.0.1`: byte-identical for every code the corpus uses (`glazed_type` + 1-8 + ND; `built_form` 1-6 + NR; `glazed_area` 1-5 + ND). The cert-side codes never reach 21.0.1's + later extensions. So the verified 21.0.1 glazing/sheltered-sides cascades apply verbatim — no per-spec + override. + +## Decision + +For the pre-SAP10 RdSAP family, **inherit ADR-0027's coefficients and validate the transfer per spec — +do not re-fit by default.** Concretely, for `18.0` (and as the rule for `17.x`/`19.0`): + +- **Reuse `0.148` and the band multipliers unchanged.** The corpus structurally cannot self-fit them + (96% band-1, zero measured band-1 windows), and where it can be measured it reproduces the inherited + model to within rounding. Re-fit a spec **only if** its own rich certs contradict the inherited model; + `18.0` does not. +- **The rich certs are a per-spec Validation Cohort, not a fit set.** Their lodged `window_area` is used + **directly** as geometry (the accuracy-where-we-have-it rule from ADR-0027 — synthesise only over the + windowless majority, never over real measured data). For `18.0` that is 10 certs direct, 990 + synthesised. +- **Route through the existing verified cascades verbatim** (glazing-type, sheltered-sides), per the + code-space diff above. +- **Schema parse fix = ADR-0027's mechanism plus one additive change.** (a) `@dataclass(kw_only=True)` + + data-driven required→optional: any field present in <100% of the corpus gets a default (`[]` for + lists, `None` otherwise) — for `18.0` that is `lzc_energy_sources`, `glazing_gap` + (`Optional[Union[int, str]]` — the corpus lodges str, int, **and** absent), `pvc_window_frames`, and + scattered `SapBuildingPart` / `AlternativeImprovement` / `PhotovoltaicSupply` fields; this takes the + parse rate from 14/1000 to 1000/1000. (b) **Add a `sap_windows` field** — the placeholder `18.0` + schema omits it entirely, so without this the 10 rich certs' lodged geometry is silently dropped at + parse time, defeating the direct-use rule. + +Because there is still no same-spec ground truth (**Validation Cohort** rule), every synthesis +assumption is recorded in code comments + test names, exactly as ADR-0027 requires. + +## Consequences + +- **The coefficients are now shared across specs.** Changing `0.148`, a band multiplier, or the 4-way + orientation split moves **every** rebaselined 20.0.0 **and** 18.0 score (and any 17.x/19.0 that later + joins). The blast radius of ADR-0027's named-constant block grew; that is the cost of transfer and the + reason the constants stay in one place with their derivation recorded. +- **The transfer is validated, not the absolute fit.** The band-4 match (0.223 obs vs pred) confirms the + *model shape* carries from 21.0.1-era stock to 2018-era stock; it does not independently establish the + base ratio for band-1, which remains inherited. Revisit if (a) the retired **RdSAP 2012** band→m² + formula is sourced, or (b) a same-spec Validation Cohort becomes available. +- **No cross-spec anchor exists in the current corpora.** A dual-lodged UPRN (same dwelling certified + under two specs) would let two re-scores cross-check, but the year-capped corpora have **zero** UPRN + overlap (18.0∩20.0.0 = 0). A true anchor would have to be *manufactured* via a targeted dual-lodged + harvest (scan the 2018 and 2022 dumps for shared UPRNs) — deferred, not part of landing 18.0. +- **Acceptance bar matches 20.0.0 (ADR-0027):** the corpus test promotes `RdSAP-Schema-18.0` into the + strict **parse + map** guard (1000/1000 return `EpcPropertyData`); it does **not** assert calculator + scores. Scoring is spot-checked manually via `scripts/eon/find_epc_data.py`; the formal score-value + test stays deferred. Expect wider lodged-vs-recalc deltas than 20.0.0 — the lodged 18.0 figure is on + an older SAP version, so it is Lodged Performance, not a target. +- **Synthesis stays copied for the first instance; the shared helper is deferred.** `18.0` adapts + ADR-0027's synthesis inline (one new instance). The shared, spec-parameterised + `_synthesise_reduced_field_windows` is extracted when `17.1` lands (the second instance), pulling + 20.0.0 + 18.0 + 17.1 through one coefficient block — avoiding abstraction from a single example while + preventing three divergent copies.