`_synthesise_20_0_0_sap_windows` passed `schema.multiple_glazing_type`
straight into `_api_cascade_glazing_type`, which raised UnmappedApiCode on
the "ND" (Not Defined) string that the 20.0.0 corpus lodges alongside the
1-8 integer codes — failing the mapper-coverage guard on every ND-glazed
20.0.0 cert. Mirror the existing 18.0/19.0/17.x seams: route integer codes
through the cascade, fall the "ND" string back to the DG-modal default
(cascade code 2 → daylight g_L 0.80). Also corrects the 20.0.0 schema
field type `int` → `Union[int, str]` to match the data (as 18.0 already
does), which keeps the isinstance guard pyright-clean.
Pre-existing failure (present before this branch's recent commits), not in
the handover regression gate. Fixes all 15 RdSAP-Schema-20.0.0 ND certs;
test_mapper_corpus 6002/6002 pass. pyright net-zero.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bumps N_POSTCODES 40 -> 150 for the fetch script. Larger corpus (150
postcodes / 3719 certs) reduces leave-one-out variance and unblocks the
recency-template work (#1223), which regressed the noisier 36-target gate
fixture. Corpus itself stays out of git (gitignored /tmp + persistent
backup at /workspaces/home/epc_prediction_corpus_backup).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Completes `_elmhurst_secondary_fuel_from_sap_code` per SAP 10.2 §12
(PDF p.34: "Secondary heating systems and applicable fuel types are taken
from the room heaters section of Table 4a") + RdSAP 10 §10.4.1. Each
Table 4a room-heater code now resolves to its fuel CATEGORY's modal fuel:
- gas room heaters 601-613 → mains gas (26 → Table 32 1, 3.48 p/kWh)
- liquid room heaters 621-625 → heating oil (28 → Table 32 4, 5.44 p/kWh)
- solid room heaters 631-636 → house coal (11 → Table 32 11, 3.67 p/kWh)
- electric room htrs 691-694/699/701 → None (cascade electricity default)
Previously only the gas (601-613→26) and solid (631-634→11) blocks were
mapped; liquid heaters (621-625) and 635-636 fell through to None →
silently billed as electricity (13.19 p/kWh), a large mis-price for an
oil/solid heater. The prior slice raised on those; this maps them to the
correct category fuel instead, and keeps the raise ONLY for codes inside
the room-heater range (601-701) that are not a recognised Table 4a row.
The specific sub-fuel within a category (mains gas vs LPG vs biogas) is a
SEPARATE lodgement per §10.4.1 and is NOT exported in the Summary, so the
gas block stays the modal mains gas — worksheet "simulated case 37" lodged
its 605 live-effect fire on biogas (7.60 p/kWh), unrecoverable from the
Summary code alone (this is the entire +7 SAP case-37 gap: secondary
energy £131 + a separate biogas standing charge £70; every other line
matches the worksheet exactly, incl. (206) main efficiency 61%).
5 AAA tests, harness 47/47 (0 raised), pyright net-zero, regression clean,
corpus gauge unchanged (Elmhurst-path only).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ADR-0029 decision 5: survivors were treated equally; now each neighbour's
vote in the cohort mode decays with its distance from the cohort's physical
centre (floor area from the median, age band from the modal band), so the
mode leans on the most representative neighbours instead of being swayed by
size/era outliers. Scales (size 20 m^2, age weight 0.5) chosen on the
validation corpus; the tight size kernel is load-bearing (looser scales
regress floor_insulation on the fixture).
Corpus (181 SAP-10.2 targets): wall_insulation 83.4->86.2%,
roof_construction 86.2->87.3%, floor_construction 78.8->81.2%,
floor_insulation 92.9->94.1%; net +7.5pp gained vs -1.1pp (two 1-cert dips,
both held on the fixture). Geometry/residuals untouched (template unchanged).
Gate (36-target fixture): zero regression across all 24 floors/ceilings;
ratcheted wall_insulation_type 0.7778->0.8333, floor_construction
0.7500->0.8125, floor_insulation 0.9062->0.9375. Dead _mode/_int_mode
removed (superseded by the weighted variants).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds PredictionConfidence (cohort size + per-component agreement = the
modal value's share among neighbours that lodge one) and
EpcPrediction.confidence(), a compute-only signal so downstream can flag
low-confidence components (ADR-0029 open item: 'confidence signal').
Sanity check on the 40-postcode corpus (1068 component predictions):
agreement is strongly predictive of correctness — pooled hit-rate 21.9%
(<0.5) / 46.7% (0.5-0.7) / 73.6% (0.7-0.9) / 95.5% (>=0.9); point-biserial
corr(agreement, correct) = 0.582. Cohort size tracks too (<6 -> 68.4%,
>=20 -> 96.0%). Surfacing / persistence is a separate HITL follow-up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Elmhurst Summary lodges only the secondary heating SAP code (Table 4a
Category 10), never its fuel. `_elmhurst_secondary_fuel_from_sap_code`
mapped the gas block (601-613 → mains gas) and solid block (631-634 →
house coal) to their modal defaults, but returned None for any OTHER
Category-10 code — and None makes the cascade SILENTLY bill the secondary
as electricity (13.19 p/kWh). For a fuel-fired heater (e.g. 621-625
liquid-fuel oil/bioethanol) that is a large, invisible mis-price.
Per the UnmappedElmhurstLabel strict-raise pattern (mirrors the wall_type
/ glazing label raises), a fuel-fired Category-10 code (601-699) outside
the mapped gas/solid blocks now RAISES instead of guessing. Electric room
heaters (691-699) keep returning None — electricity IS their fuel.
The gas block 601-613 still resolves to the modal default mains gas: the
Summary cannot distinguish mains gas from LPG/biogas, so an LPG or biogas
live-effect fire (worksheet "simulated case 37" used biogas at 7.60 p/kWh
vs our 3.48 p/kWh mains-gas default, a +7 SAP gap) is not recoverable from
the Summary export — that is a data-availability limit, not a guess we can
fix here. This commit closes the genuinely-silent-wrong path; the gas
sub-fuel remains the documented modal default.
Worksheet harness 47/47, 0 raised. 3 AAA tests, pyright net-zero,
regression clean, corpus gauge unchanged (Elmhurst-path only; the API path
lodges the secondary fuel explicitly).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Measurement honesty so we optimise SAP-relevant accuracy, not SAP-neutral
misses (ADR-0030 Component Accuracy):
- Add construction_age_band_pm1: an exact-or-adjacent-band hit. Adjacent
RdSAP age bands carry near-identical U-values, so an off-by-one is
~SAP-neutral. Full corpus: exact 78.5% but ±1-band 91.7% (fixture
63.9% -> 83.3%) — most age misses are adjacent.
- Drop window_count from the gate's residual ceilings (cosmetic): the
predicted picture clusters at a mapper-default 4 windows vs actuals 1-21,
but total_window_area (the SAP-relevant signal) stays tight at ~3.4 m2.
Gate: + construction_age_band_pm1 floor 0.8333; window_count no longer gated.
Closes#1222
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
End-to-end API-path regression pin for the battery behaviour validated by
the user-simulated Elmhurst worksheet pair (cert 001431 "simulated case
35/36", 5 kWh, export-capable, mains-gas, standard tariff). The official
SAP rating ("10a. Fuel costs - using Table 12 prices") values PV used-in-
dwelling and PV exported identically at 13.19 p/kWh (export code 60 ==
import code 30, ADR-0010), so a battery only redistributes PV between two
equally-priced lines: worksheet PV credit (252) = -455.6458 and SAP (258)
= 88.0859 are IDENTICAL with/without the battery (ΔSAP = 0).
Two tests over the committed RdSAP-21.0.1 corpus:
- standard tariff (meter 2): toggling the battery holds continuous SAP
EXACTLY constant, while at least one cert's primary energy DOES respond
(proving the App-M1 §3c β-split is wired, not a dropped battery).
- off-peak tariff (meter != 2): the battery STRICTLY raises SAP, because
self-consumed PV displaces high-rate import (15.29) above the 13.19
export credit — confirming the standard-tariff neutrality is a price
coincidence, not a no-op.
Guards table_32 export price (code 60) and the battery β-split against
silent regression. Complements the unit-level β tests in
test_photovoltaic.py.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Investigated recency-weighting (weight cohort votes by an exponential decay
in cert age). Key finding: it must be SELECTIVE. On the validation corpus it
HURTS permanent categoricals (wall 91.2->89.5, age 78.5->75.7 — discards
still-valid data) but clearly HELPS time-varying ones, where a recent
neighbour reflects the current physical state:
roof_insulation_thickness 56.7 -> 60.7% corpus (+4pp)
29.4 -> 41.2% fixture (+12pp)
So apply a recency-weighted mode only to roof_insulation_thickness (loft
top-ups happen over time); keep the plain mode for permanent categoricals.
tau = 4yr (~2.8yr half-life); falls back to plain mode when no registration
dates are lodged. Gate floor ratcheted 0.2941 -> 0.4118.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
These independent fabric categoricals were template-copied; mode them like
the construction categoricals. Verified mode beats template before applying.
Big fixture win on roof insulation thickness (doubled), floor insulation
neutral-to-positive:
roof_insulation_thickness 14.7% -> 29.4% (gate floor ratcheted up)
floor_insulation 90.6% (unchanged on the fixture)
Glazing type was tried too (+1.6pp on the 40-postcode corpus) but REGRESSED
the 36-target fixture (0.50 -> 0.44) — the gate caught it. Glazing moding is
marginal/noisy, so it's left on the template; revisit with a larger corpus.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Tier-2 (full national bulk streaming) is deferred. The near-term scale
validation is a Tier-1.5: a few-thousand-cert anonymised corpus stored in
S3 (too large to commit, far more stable than the 36-target gate fixture),
pulled to a temp dir and run through the same load_corpus +
evaluate_component_accuracy. Reuses the committed-fixture machinery wholesale
— only the data source differs. One scorer, three data sources (committed
fixture / S3 corpus / bulk stream).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The committed CI gate: run the calculator-free leave-one-out scorer over the
frozen anonymised fixture (36 SAP-10.2 targets) and assert each per-component
classification rate / geometry residual is no worse than a committed baseline.
Prediction is deterministic + the fixture frozen, so the numbers reproduce
exactly — a failure is a real regression, never sample noise.
- 19 rate floors + 5 residual ceilings, seeded at the currently-measured
values; they only ever tighten (no-widening ethos on an aggregate).
- Calculator-FREE — component floors are the real gate; the end-to-end
SAP/carbon/PE guards stay out (their floor is the separate API-path
calculator workstream).
- Skips with a message when the fixture is absent.
25 parametrized assertions, all green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The committed gate needs frozen, reproducible data without dumping real UK
addresses into the repo. Add:
- harness anonymise_payload + stable_hash: hash street address + cert number
into opaque, dedup-stable tokens; blank secondary address lines + post_town;
keep postcode + all component/lodged fields (gov data is OGL). Unit-tested.
- scripts/build_epc_prediction_fixture.py: curate qualifying postcodes (>=1
SAP 10.2 target + >=2 distinct addresses) from the local scratch corpus,
anonymise, freeze under tests/fixtures/epc_prediction/.
- The frozen fixture: 15 postcodes / 280 certs / 36 SAP-10.2 targets.
Verified no plaintext address_line_1 and post_town all blank.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
A "No system present: electric heaters assumed" lodging carries SAP
Table 4a code 699 (electric room heaters) but RdSAP main_heating_category
1, NOT 10. `_table_12a_system_for_main` keyed the direct-acting-electric
routing on category==10 only, so the category-1 form fell through to None
and `_space_heating_fuel_cost_gbp_per_kwh` billed space heating 100% at
the off-peak LOW rate — as if direct-acting room heaters charged overnight
like storage.
Per RdSAP 10 §12 Rule 3 (PDF p.62) electric room heaters (691-694, 699)
route to the 10-hour tariff, and SAP 10.2 Table 12a Grid 1 (PDF p.191)
gives the "other direct-acting electric" row a 0.50 high-rate fraction at
10-hour (1.00 at 7-hour). Route those SAP codes — the same set §12 Rule 3
already uses — to OTHER_DIRECT_ACTING_ELECTRIC alongside the category-10
gate.
Found via the PE/CO2-vs-cost split on the worst over-rater in the /tmp
sample: cert 2958 PE +0% / CO2 -1% (energy correct) but SAP +32.2 — a
pure cost-side bug. Space rate 7.50 -> 11.09 p/kWh; cert 2958 +32.2 ->
+14.7. The committed corpus gauge is unchanged (its 3 non-category-10
code-699 certs are all on Single meters -> STANDARD tariff, so this split
never applies to them); the win is on the unbiased /tmp population's
single worst cert.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
"One scorer, two harnesses" (ADR-0030): the committed gate, the local script,
and the future battle-test must run the *same* scoring. Extract it:
- domain/epc_prediction/validation.py — `iter_predictions` (the single
leave-one-out orchestration: latest-per-address hold-out, SAP-10.2 target
filter, all-vintage source) + `evaluate_component_accuracy` (calculator-free
ComponentAccuracy aggregation, the primary signal). Unit-tested.
- harness/epc_prediction_corpus.py — `load_corpus(dir)` IO: corpus dir ->
Comparable cohorts (maps payloads, carries address + registration_date).
validate_epc_prediction.py now just loads + calls the scorer for the component
section and iterates iter_predictions for the calculator-floored end-to-end.
Identical numbers (181 targets, SAP MAE 6.34) — behaviour-preserving.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Make the leave-one-out runner ADR-0030-compliant:
- Hold out only SAP 10.2 targets (sap_version == 10.2) — the source cohort
keeps every vintage (components are methodology-agnostic).
- Label Component Accuracy as the PRIMARY, calculator-independent section.
- End-to-end vs API-lodged (SECONDARY, calculator-FLOORED): add CO2 (tonnes)
and PEI (kWh/m2) alongside SAP, using the canonical performance.py mapping
(co2_kg/1000; primary_energy_kwh_per_m2).
- Add the attribution readout calc(actual) vs lodged SAP — the calculator
floor the end-to-end can reach.
- Drop the neighbour-mean-of-lodged-SAP baseline (mixes SAP versions —
rejected by ADR-0030).
On the 181 SAP-10.2 targets: component rates are higher than the all-vintage
view (age band 60.9 -> 78.5%, floor_area mean|.| 12.7 -> 8.4). End-to-end SAP
MAE 6.34 vs the calc(actual) floor of 3.25 — ~half the gap is the known
API-path calculator residual, not prediction error.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Heating is the dominant SAP lever (ablating it to actual cut the SAP error
~7 -> ~4.5) yet was entirely unscored. Add the heating group to
compare_prediction's categorical_hits: main fuel / category / control (off
the primary MainHeatingDetail), water-heating fuel / code, has-cylinder,
cylinder insulation, secondary heating (off SapHeating).
Template-copied baseline on the 40-postcode corpus (no predictor change
yet — this just makes the signal visible):
heating_main_fuel 93.4%
heating_main_category 92.7%
water_heating_fuel/code 91.7% / 92.4%
heating_main_control 62.1% <- weak
has_hot_water_cylinder 78.5%
cylinder_insulation_type 35.8% (n=120) <- weak
secondary_heating_type 16.8% (n=125) <- weak
Fuel/category predict well from the template; controls, cylinder, and
secondary heating are poor and now drive the next predictor slices.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ADR-0030 commits Component Accuracy to ~19 categorical components (5 today
+ 8 heating + glazing/renewables). Flat *_correct dataclass fields don't
scale — each needs manual runner wiring. Collapse them into a single
`categorical_hits: dict[str, Optional[bool]]` keyed by component name, which
also matches the runner's name-keyed aggregation (now generic: it tallies
whatever components the comparison reports). No behaviour change; the
classification rates are identical (wall n 578->575 is the 3 certs whose
actual wall is None, now correctly counted as not-applicable via _classify).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
SAP 10.2 Appendix M1 (PDF p.94): "EPV,ex,m = 0 if the PV system is not
connected to an export-capable meter." The cascade computed the β-split
export stream regardless of `is_dwelling_export_capable`, so a non-export-
capable dwelling was credited the full PV export — in the §10a COST it
credits at the Table 32 import rate (13.19 p/kWh), which dominates the rating.
On 7 Wybourn Terrace S2 5BJ the PE (144 vs lodged 151) and CO2 (27 vs 29)
already matched, yet the phantom export cost credit pulled SAP from ~73 to
92.1 (+19). Zero `epv_exported_monthly_kwh` after the Appendix-G4 diverter
adjustment when not export-capable; the onsite (EPV,dw) consumption and the
diverter HW reduction are unchanged.
Not-export-capable PV cohort (corpus, 4 certs): 7 Wybourn +19.1 -> +6.5,
4 Lime Ave +11.1 -> +0.4, 8 Hatherleigh +7.6 -> -0.2, Flat 5 ~-0.4. Gauge
66.1% -> 66.9%, MAE 1.124 -> 1.039. Floor 0.64 -> 0.65 / ceiling 1.18 -> 1.08.
Worksheet harness 47/47 0 diverge (Summary certs carry export-capable meters).
1 AAA test, pyright net-zero. Found by auditing the worst over-rater without a
worksheet: PE/CO2-match + cost-miss localised it to the PV export credit.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Records the grilling-session decisions amending ADR-0029's validation:
- Source cohort keeps all cert vintages (components are agnostic of the SAP
methodology that rated them); only the held-out validation TARGET is
restricted to SAP 10.2. Amends ADR-0029 decision 5 ("pre-SAP10 dropped").
- Component Accuracy (predicted vs API actual components) is the primary,
calculator-independent signal. calc(predicted) vs calc(actual) rejected
(circular ground truth, hides calculator error); neighbour-mean-lodged-SAP
baseline rejected (mixes SAP versions). calc(predicted) vs API-lodged
SAP/carbon/PE kept as a secondary, calculator-floored guard.
- Two tiers: committed anonymized fixture (ratcheting CI gate) + bulk-export
national battle-test on harness/epc_bulk.py + harness/cohort.py, emitting
accuracy + a failure taxonomy, re-baselining the gate floors.
CONTEXT.md: Comparable Properties corrected to all-vintage source; new
Component Accuracy term. ADR-0029 Validation section marked superseded.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
RdSAP 10 §10.5 (PDF p.55): "If the actual size is not determined, the size of
a hot-water cylinder is taken as according to Table 28." When a cylinder is
present (has_hot_water_cylinder) but no size descriptor resolves — the gov API
lodges cylinder_size=0, or Exact with no measured volume — `_hot_water_
cylinder_volume_l` returned None, silently dropping BOTH the cylinder's
storage loss and the Table 13 electric-DHW high-rate fraction, under-costing
and over-rating the dwelling. Default such cylinders to the Table 28 baseline
"Normal" 110 L (the value §10.7 also instantiates as the first-row default).
The context-dependent Inaccessible 210/160 values are deliberately NOT applied
here — they are tied to the explicit "Inaccessible" descriptor (code 5) the
assessor lodges, not to an unpopulated size field.
Scope: 7 of 301 cylinder certs in the corpus (2%). Correctness fix — closes a
real spec gap; marginal on the headline (within-0.5 66.1% unchanged, MAE
1.128 -> 1.124) because these certs' residual is dominated by a separate HW-
demand gap, not the cylinder. Worksheet harness 47/47 0 diverge (Summary certs
lodge a real size, so the fallback never fires). 1 AAA test, pyright net-zero.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
SAP 10.2 Table 12a Grid 1 (PDF p.191): electric storage heater SAP code 408
is an "Integrated (storage + direct-acting) system" with a 0.20 space-heating
high-rate fraction on a 7-hour tariff — NOT the 0.00 of "other storage
heaters". `_table_12a_system_for_main` returned None for all storage codes (an
explicit TODO), so code 408 fell back to the 100%-low-rate path and billed
space heating at the bare 7-hour low rate (5.50 p/kWh) — under-costing →
over-rating. Mapped cat-7 storage: 408 -> INTEGRATED_STORAGE_DIRECT (0.20),
others -> OTHER_STORAGE_HEATERS (0.00, unchanged behaviour). The enum +
fraction rows already existed; this only wires the dispatch, so the split
flows self-consistently to both the §10a cost and the Appendix-M1 D_PV
high-rate fraction.
Corpus: sap408 over-raters +14.6/+12.9/+12.7 -> +7.1/+5.1/+3.4 (two crossed
into within-0.5). Gauge 65.9% -> 66.1%, MAE 1.160 -> 1.128. Floor 0.63 -> 0.64
/ MAE ceiling 1.22 -> 1.18. Worksheet harness 47/47 0 diverge. The residual
+3..+7 is the "all other uses" 0.90 high-rate fraction (lighting/pumps/HW
still billed 100%-low on the off-peak legacy path) — the next slice.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
SAP 10.2 Table 4c(3) (PDF p.169) "Factor for controls and charging method"
multiplies a heat network's heat requirement by 1.05-1.10 for FLAT-RATE
charging (note d: household pays a fixed amount regardless of heat used, so
no incentive to economise), and by 1.0 for charging linked to use. The
worksheet folds it into the heat-network requirement alongside the Table 12c
distribution loss factor:
(307) space = (98c) x (302) x (305) x (306)
(310) DHW = (64) x (305a) x (306)
Our cascade applied (306) DLF but never (305)/(305a), so every flat-rate
community-heating cert under-counted demand -> over-rated SAP.
Folded the factor into the 1/DLF efficiency override at the space-heating
(206) and DHW (water-inherits-from-main) sites. Space column adds +0.05 for
no thermostatic control (2301/2302); DHW column is 1.05 flat-rate / 1.0
linked-to-use.
Corpus (RdSAP-21.0.1, 1000 certs): community cluster median +0.32 -> -0.19,
within-0.5 38% -> 62% (control 2307 +0.83 -> -0.19; 2306 unchanged at factor
1.0 as spec requires). Overall gauge 65.0% -> 65.9%, MAE 1.174 -> 1.160.
Ratcheted the corpus-test floor 0.62 -> 0.63 / MAE ceiling 1.25 -> 1.22.
Also records (corpus-test comment + scripts/decompose_co2_pe_error.py) the
disproof of the prior "CO2/PE +5% is a factor/scope bug" lead: factors are
spec-exact, scope identical, and the bias is per-cert demand fidelity
(corr(SAP-err, PE-diff) = -0.54), not a one-slice factor fix.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Bump N_POSTCODES 150 -> 40 as the gradual-growth step from the 3-postcode
smoke. 40 postcodes / 1113 certs / 578 leave-one-out predictions is enough
for stable, trustworthy metrics (the smoke's 2 usable postcodes were
dominated by oddball flats — floor_area mean|.| 52.6 there vs 12.7 here).
Resumable + reproducible (random.seed(2026)); raise again to scale up.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The register lists every historical lodgement, so a postcode cohort
contains the same physical address many times (LS61AA: 15 certs / 11
addresses; NG71AA: 15 / 9 — "FLAT 3" appears 3x in each). Two
consequences:
- Production: a re-lodged neighbour was counting up to 3x towards the
cohort mode. select_comparables now dedupes candidates to the latest
cert per address (one comparable per real neighbour) — Comparable
gains address + registration_date (the register metadata its docstring
already anticipated, read straight off the cached payload).
- Validation: leave-one-out leaked — predicting a flat from a near-
identical re-lodgement of itself. The harness now holds out a whole
address (excludes every sibling cert) and evaluates on the latest cert
per address (the best ground truth).
Removing the leak gives the honest numbers (19 distinct addresses):
wall_construction 93.1% -> 89.5%
construction_age_band 65.5% -> 52.6%
roof_construction 79.3% -> 68.4%
floor_area mean|.| 37.9 -> 52.6 m2
The earlier figures were inflated by self-leakage; these are the real
accuracy to beat.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Only main wall_construction was set to the cohort mode; the other
homogeneous categoricals (wall insulation, construction age band, roof
construction, floor construction) were left as template-copied, so one
median-size template's quirks set them. Apply the same cohort-mode
mechanism to all of them per ADR-0029 decision 4 — the template still
supplies geometry, only the categorical codes move to the mode.
Verified mode beats (or ties) template-copy per categorical before
applying. Smoke corpus (29 leave-one-out) classification rates:
construction_age_band 55.2% -> 65.5%
roof_construction 72.4% -> 79.3%
floor_construction 46.2% -> 84.6%
wall_insulation_type 93.1% (tie — already template-strong)
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The comparison only scored main wall_construction; everything else the
predictor produces (by template-copy) went unmeasured. Extend
compare_prediction to the rest of the ADR-0029 homogeneous categoricals —
wall insulation type, construction age band, roof construction, floor
construction — and aggregate per-categorical classification rates in the
runner. A categorical hit is "not applicable" (None, excluded from the
denominator) when the actual lodges no value, so absent-roof flats don't
score free wins.
Smoke corpus (29 leave-one-out, all but wall are template-copied today):
wall_construction 93.1%
wall_insulation_type 93.1%
construction_age_band 55.2% <- loud; candidate for cohort-mode
roof_construction 72.4%
floor_construction 46.2% (n=13)
These numbers drive the next slice (extend cohort-mode coverage).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Template (the comparable whose structure/geometry is copied wholesale)
was members[0] — an arbitrary draw from the API search order. With floor
area varying widely within a property_type cohort (NG71AA houses span
51-340 m2), this made the copied geometry noisy and systematically large.
Pick the member whose floor area is closest to the cohort median instead,
implementing ADR-0029 decision 4's unimplemented "closest on size"
criterion while keeping the structure coherent (it is still one real
property, so floor dims / windows / parts stay internally consistent for
the calculator).
Smoke corpus (29 leave-one-out predictions):
floor_area mean|.| 68.0 -> 37.9 m2 (bias +46.8 -> -3.9)
window_area mean|.| 11.1 -> 7.3 m2
parts mean|.| 1.00 -> 0.38
SAP |pred-calc - calc(actual)| MAE 7.19 -> 4.86
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pure compare_prediction (TDD): wall-construction classification hit + signed
residuals on floor area, window count, total window area, building-parts count.
Plus validate_epc_prediction.py (IO plumbing): drops each cert from its postcode
cohort, predicts from the rest on guaranteed inputs only, aggregates the metrics,
and reports SAP three ways (pred-calc vs lodged / vs calc-on-actual / vs the
neighbour-mean baseline). Smoke run: wall 90.9%, floor-area mean|·| 42.6 m2 (a
real signal — template-copied floor area is noisy), SAP pred-calc edges baseline.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
predict() copies a representative template comparable's structure (coherent for
the calculator), overrides the homogeneous categorical with the cohort mode
(robust to an atypical template), then applies known Landlord Overrides on top
(a known value wins over the estimate). Proven on wall construction; roof/floor/
insulation/age extend on the same mode+override mechanism, driven next by the
validation harness metrics.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pure-domain select_comparables: property type is an always-hard filter; built
form and known Landlord Overrides (e.g. solid brick) are conditioning filters on
the filter-then-relax ladder — applied while >= minimum_cohort survive, relaxed
otherwise (the mixed-street border case degrades gracefully). PredictionTarget
(known inputs) + Comparable (epc + register metadata) + ComparableProperties
(selected cohort). Weighting (recency x similarity) follows in the synthesis slice.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Adds a committed integration test driving the full API path — raw gov-EPC
response → from_api_response → cert_to_inputs → calculate_sap_from_inputs —
across all 1000 certs in the in-repo RdSAP-21.0.1 corpus, and pins the
aggregate accuracy of our continuous SAP (plus CO2 and primary energy)
against each cert's lodged figures. Mirrors scripts/eval_api_sap_accuracy.py
but runs in CI off the committed corpus (~2s, no /tmp sample needed).
Scoped to RdSAP-21.0.1 — the SAP 10.2-era schema whose lodged rating uses the
same methodology we compute (a fair target). Pre-SAP10 schemas (17.x-20.0.0)
lodge SAP 2012 ratings and are out of scope (guarded for mapping only by
test_mapper_corpus.py).
Current: SAP within-0.5 = 65.0%, MAE = 1.174 (tight floor/ceiling — the
optimised gauge). CO2 MAE = 0.27 t/yr (bias +0.17) and PE MAE = 14.6
kWh/m2/yr (bias +8.9) are reported + loosely guarded: cost is well-calibrated
but CO2/PE both run ~+5-10% high (uniform across fuels — a systematic
CO2/PE-factor or scope gap, not yet investigated). Thresholds ratchet as
slices tighten each metric.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Builds the frozen validation corpus: samples postcodes from the register, then
caches each postcode's full cohort of raw cert payloads (the shape
from_api_response consumes), grouped by postcode, resumably. Reads the token
from backend/.env; cache dir /tmp/epc_prediction_corpus (EPC_PREDICTION_CORPUS
override). IO plumbing, not test-driven. Pairs with the leave-one-out harness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The mixer-shower hot-water demand (worksheet 42a) divided N_shower by the
count of MIXER outlets only. But SAP 10.2 Appendix J step 1a is explicit:
"Establish how many shower outlets are present in the dwelling, Noutlets
(including in the count any instantaneous electric showers)" — and the
electric-shower step (64a) uses that same Noutlets from step 1a. So a
dwelling with both a mixer and an electric shower assigned the FULL N_shower
to the mixer system AND billed the electric shower on top of it, double-
counting shower demand → over-counted main HW → under-rated the dwelling.
Fix: thread the electric-shower count into the mixer demand so the
denominator is the total outlet count (mixer + electric), iterating the
warm-water draw over the mixer outlets only (per step 1e).
shower_types=1,2 cohort: -0.37 median -> +0.28 (crossed zero); API gauge
68.4% -> 69.0% within-0.5. Golden cert 0300-2747 (1 mixer + 1 electric)
re-pinned: PE +0.93 -> -0.10, CO2 +0.25 -> +0.15 (both toward zero,
confirming the double-count). Worksheet harness 47/47, 0 divergers (the
Elmhurst fixtures have no electric showers).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The no-PCDB MEV fan-electricity path fed the SAP 10.2 Table 4g default SFP
(0.8 W/(l/s)) directly as SFPav. But Table 4g note 3 (PDF p.176) is explicit:
the default SFP values "are to be multiplied by the appropriate in-use factor
for default data from the PCDB" — PCDB Table 329 system_type 10 ("default
data, used when SFP is taken from Table 4g rather than the PCDB"), IUF 2.5
(duct-agnostic per note 2). Table 4h, which previously held these factors, is
retired ("no longer used – data now stored in the PCDB").
Omitting the IUF under-billed the index-less MEV fan electricity by 2.5x
(SFPav 0.8 instead of 0.8 x 2.5 = 2.0), so cost was too low and the cohort
over-rated. This is distinct from the with-index path, which already applies
the tested-product system_type-2 "no scheme" IUF (~1.45) per fan.
Index-less gas-house MEV cohort: +1.37 median -> -0.18 (12% -> 92% within 0.5),
no overshoot — the missing IUF was exactly the over-rate. API gauge 67.7% ->
68.4% within-0.5 (mean|err| 0.992 -> 0.986, signed +0.031 -> +0.006).
Worksheet harness 47/47, 0 divergers (Summary-path MEV certs carry a PCDB
index or are natural, so unaffected).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>