fix(mapper): close modelling_e2e cohort mapping gaps (built_form, PV-list, windows-dict)

Closes the mapper-coverage gaps surfaced by the modelling_e2e prediction-cohort
failures (portfolio 796):

- built_form (SAP-16.0): derive from dwelling_type in _normalize_sap_schema_16_x
  (Mid-terrace->4, End-terrace->3, Semi-detached->2, Detached->1; flats->modal 4).
  ML-only field (SAP calc never reads it) so SAP- and gate-neutral. 5 flat certs
  that omitted built_form now map.
- photovoltaic_supply as a measured-array LIST: routed all pre-21 RdSAP mappers
  (17.0/17.1/18.0/19.0/20.0.0) through _map_schema_21_pv, whose list branch is now
  dict-tolerant (_pv_array_field reads dict OR dataclass). They capture the PV
  arrays like 21.0.x instead of raising "'list' object has no attribute
  none_or_no_details" and sinking the whole cohort.
- windows-as-dict (16.x): handled in the normalizer (not just windows-as-list).

Genuinely-sparse certs (omit door_count/habitable/glazed_area) remain fail-loud;
the gate-regressing multiple_glazed_proportion default and the recursive
RdSAP-21.0.0 ADR-0028 alignment are left fail-loud + flagged for review (worklist).

+5 regression tests; component-accuracy gate 26/26; 0 new pyright errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
Jun-te Kim 2026-06-23 14:51:40 +00:00
parent f09f01adbf
commit 6ee2f6257a
4 changed files with 177 additions and 55 deletions

View file

@ -163,6 +163,47 @@ epc.json + elmhurst_summary.pdf + elmhurst_worksheet.pdf under
After all: re-run the modelling_e2e failure sweep to confirm the cohort skips drop
to zero (or only P6-class), and that no new gate regression slipped in.
### ✅ OUTCOMES (2026-06-23, "do all") — 553 tests pass, gate 26/26, 0 new pyright
- **P1 ✅ FIXED**`_normalize_sap_schema_16_x` derives `built_form` from the
dwelling_type text (`_derive_built_form_16x`: Mid-terrace→4 / End-terrace→3 /
Semi-detached→2 / Detached→1; flats → modal 4). built_form is ML-only (SAP calc
never reads it) so SAP-neutral; gate verified insensitive (26/26). 5 SAP-16.0
flat certs now map (8742→SAP 66=lodged, etc.). +2 tests.
- **P2 ✅ FIXED**`photovoltaic_supply` as a measured-array LIST crashed
`from_rdsap_schema_{17_0,17_1,18_0,19_0,20_0_0}` (`'list' has no attribute
none_or_no_details`). All 5 now route through `_map_schema_21_pv`, whose list
branch is made dict-tolerant (`_pv_array_field` reads dict OR dataclass) — they
now CAPTURE the PV arrays like 21.0.x instead of crashing/dropping. cert
6102-…-2292 (uprn 22086693, RdSAP-20.0.0) maps with 2 arrays @1.14kW. +test.
- **P3 ✅ generalised / cert fail-loud**`_normalize_sap_schema_16_x` now handles
`windows` as a dict (not just list). cert 9768 (uprn 100062614005, 16.1) still
fails loud — it ALSO omits door_count/habitable_room_count/glazed_area
(genuinely sparse "High performance glazing" cert) → CORRECT to fail; resilience
skips+reports it.
- **P6 ✅ same as P3** — cert 8257 (uprn 10070009758, 16.2) — identical sparse
"High performance glazing" pattern, fails loud on door_count. Correct.
- **P4 ⚖️ FOR KHALIM (gate decision)** — left `multiple_glazed_proportion` fail-loud.
Re-confirmed: defaulting it (ANY value) still regresses the gate (29/36 +
25/36) because it makes a frozen-fixture cert a prediction donor — the field is
ML-only (not carried to EpcPropertyData), so the regression is donor-pool
composition, NOT the value. Closing this gap needs a GATE RE-BASELINE decision
(whether the donor-pool tip is benign per the #1245 precedent) — your call; I did
not unilaterally loosen the tighten-only gate.
- **P5 🔍 FOR KHALIM (large recursive alignment)** — RdSAP-21.0.0 lacks the
ADR-0028 Optional widenings 21.0.1 has, and the gap is RECURSIVE (not just the
~13 top-level + SapWindow): `SapBuildingPart` (13 fields incl.
construction_age_band/wall_construction/identifier — CORE FABRIC),
`SapEnergySource` (pv_battery_count, wind_turbine_details), MainHeatingDetail,
ShowerOutlets, PvBatteries, Addendum, + top-level low_energy_fixed_lighting_
bulbs_count/multiple_glazed_proportion/suggested_improvements. Aligning to
21.0.1's "everything Optional, tolerate sparse" means the 21.0.0 mapper must
None-coalesce CORE fabric — a modelling decision in your domain. Partial widening
was REVERTED (half-aligned schema is worse). cert 3135 (uprn 100021919725) stays
fail-loud → resilience skips+reports. Recommend: align RdSapSchema21_0_0 fully to
21.0.1 (recursive) + review mapper coalescing, in a dedicated change.
- **Corpus-validation pending** for the 2 now-mappable certs (P1 uprn 10070004512,
P2 uprn 22086693) — Elmhurst build + summary/worksheet + commit.
- [x] 🔧 10096028301 — SAP-Schema-19.1.0 (full-SAP g/f FLAT, band M, combi PCDB 17929, MEV, AP50 3.5) · eng 82 / elm 82 (lodged 85) · PINNED engine 82. 🔧 mapper added: `from_sap_schema_19_1_0`. Built in Elmhurst (boiler 17929 via search, control CBE/2106, water from primary, MEV on, AP50 Blower Door 3.5+cert). Engine EXACTLY matches Elmhurst worksheet (82.11 vs 82); engine-on-Elmhurst-inputs 82.16 ≈ 82 → calculator faithful. 3 vs lodged = measured-U-vs-RdSAP-default + MEV extract-not-recovery (documented). No mapper change beyond coverage.
- [x] 🔧 100021943298 — SAP-Schema-16.1 (g/f FLAT, band B, solid-brick internal, combi PCDB 10328) · eng 76 / elm 75 (lodged 72) · PINNED engine 76. 🔧 mapper added: `from_sap_schema_16_1`. Built in Elmhurst (boiler 10328 via search, control CBE/2106, water from primary, wall insulation thickness Unknown); worksheet 75 → engine within ~1 (tightest agreement, reduced-field). Boiler-select + water-heating + control dialogs all driven via automation (two-step row→Select / cascade + coordinate-OK). No mapper change beyond coverage.

View file

@ -189,6 +189,15 @@ def _sap_opening_area_m2(width: Any, height: Any) -> float:
)
def _pv_array_field(array: Any, name: str) -> Any:
"""Read a PV-array field from either a `PhotovoltaicArray` dataclass (21.0.x,
where the schema Union parsed the nested list) or a raw dict (older schemas
that leave the list unparsed)."""
if isinstance(array, dict):
return cast(Dict[str, Any], array).get(name)
return getattr(array, name)
def _map_schema_21_pv(
es_pv_supply: Any,
) -> tuple[Optional[PhotovoltaicSupply], Optional[List[PhotovoltaicArray]]]:
@ -206,12 +215,18 @@ def _map_schema_21_pv(
None. With no PV data at all, both are None.
"""
if isinstance(es_pv_supply, list):
# The nested list's leaves are `PhotovoltaicArray` dataclasses when the
# schema's Union type parsed them (21.0.x), or raw dicts when an older
# schema (17.020.0.0) types the field as the wrapper only and leaves the
# list unparsed — `_pv_array_field` reads either shape.
flattened = [
PhotovoltaicArray(
peak_power=_measurement_value(array.peak_power),
pitch=int(_measurement_value(array.pitch)),
orientation=_pv_orientation(array.orientation),
overshading=int(_measurement_value(array.overshading)),
peak_power=_measurement_value(_pv_array_field(array, "peak_power")),
pitch=int(_measurement_value(_pv_array_field(array, "pitch"))),
orientation=_pv_orientation(_pv_array_field(array, "orientation")),
overshading=int(
_measurement_value(_pv_array_field(array, "overshading"))
),
)
for inner_list in es_pv_supply
for array in inner_list
@ -538,6 +553,7 @@ class EpcPropertyDataMapper:
@staticmethod
def from_rdsap_schema_17_0(schema: RdSapSchema17_0) -> EpcPropertyData:
es = schema.sap_energy_source
pv_supply, pv_arrays = _map_schema_21_pv(es.photovoltaic_supply)
return EpcPropertyData(
uprn=schema.uprn,
assessment_type=schema.assessment_type,
@ -656,17 +672,8 @@ class EpcPropertyDataMapper:
is_dwelling_export_capable=False,
wind_turbines_terrain_type=str(es.wind_turbines_terrain_type),
electricity_smart_meter_present=False,
photovoltaic_supply=(
PhotovoltaicSupply(
none_or_no_details=PhotovoltaicSupplyNoneOrNoDetails(
percent_roof_area=es.photovoltaic_supply.none_or_no_details.percent_roof_area,
)
)
# ADR-0028: photovoltaic_supply can be absent, None, or a
# sparse shape without none_or_no_details — guard the read.
if getattr(es.photovoltaic_supply, "none_or_no_details", None)
else None
),
photovoltaic_supply=pv_supply,
photovoltaic_arrays=pv_arrays,
),
sap_building_parts=[
SapBuildingPart(
@ -954,6 +961,7 @@ class EpcPropertyDataMapper:
@staticmethod
def from_rdsap_schema_17_1(schema: RdSapSchema17_1) -> EpcPropertyData:
es = schema.sap_energy_source
pv_supply, pv_arrays = _map_schema_21_pv(es.photovoltaic_supply)
# ADR-0028: instantaneous_wwhrs holds bath/shower ROOM counts.
iw = schema.sap_heating.instantaneous_wwhrs
return EpcPropertyData(
@ -1097,15 +1105,8 @@ class EpcPropertyDataMapper:
is_dwelling_export_capable=False,
wind_turbines_terrain_type=str(es.wind_turbines_terrain_type),
electricity_smart_meter_present=False,
photovoltaic_supply=(
PhotovoltaicSupply(
none_or_no_details=PhotovoltaicSupplyNoneOrNoDetails(
percent_roof_area=es.photovoltaic_supply.none_or_no_details.percent_roof_area,
)
)
if getattr(es.photovoltaic_supply, "none_or_no_details", None)
else None
),
photovoltaic_supply=pv_supply,
photovoltaic_arrays=pv_arrays,
),
sap_building_parts=[
SapBuildingPart(
@ -1145,6 +1146,7 @@ class EpcPropertyDataMapper:
@staticmethod
def from_rdsap_schema_18_0(schema: RdSapSchema18_0) -> EpcPropertyData:
es = schema.sap_energy_source
pv_supply, pv_arrays = _map_schema_21_pv(es.photovoltaic_supply)
# ADR-0028: instantaneous_wwhrs holds bath/shower ROOM counts.
iw = schema.sap_heating.instantaneous_wwhrs
return EpcPropertyData(
@ -1295,16 +1297,8 @@ class EpcPropertyDataMapper:
is_dwelling_export_capable=False,
wind_turbines_terrain_type=str(es.wind_turbines_terrain_type),
electricity_smart_meter_present=False,
photovoltaic_supply=(
PhotovoltaicSupply(
none_or_no_details=PhotovoltaicSupplyNoneOrNoDetails(
percent_roof_area=es.photovoltaic_supply.none_or_no_details.percent_roof_area,
)
)
if es.photovoltaic_supply
and es.photovoltaic_supply.none_or_no_details
else None
),
photovoltaic_supply=pv_supply,
photovoltaic_arrays=pv_arrays,
),
sap_building_parts=[
SapBuildingPart(
@ -1375,6 +1369,7 @@ class EpcPropertyDataMapper:
@staticmethod
def from_rdsap_schema_19_0(schema: RdSapSchema19_0) -> EpcPropertyData:
es = schema.sap_energy_source
pv_supply, pv_arrays = _map_schema_21_pv(es.photovoltaic_supply)
return EpcPropertyData(
uprn=schema.uprn,
assessment_type=schema.assessment_type,
@ -1496,17 +1491,8 @@ class EpcPropertyDataMapper:
is_dwelling_export_capable=False,
wind_turbines_terrain_type=str(es.wind_turbines_terrain_type),
electricity_smart_meter_present=False,
photovoltaic_supply=(
PhotovoltaicSupply(
none_or_no_details=PhotovoltaicSupplyNoneOrNoDetails(
percent_roof_area=es.photovoltaic_supply.none_or_no_details.percent_roof_area,
)
)
# ADR-0028: photovoltaic_supply can be absent, None, or a
# sparse shape without none_or_no_details — guard the read.
if getattr(es.photovoltaic_supply, "none_or_no_details", None)
else None
),
photovoltaic_supply=pv_supply,
photovoltaic_arrays=pv_arrays,
),
sap_building_parts=[
SapBuildingPart(
@ -1581,6 +1567,12 @@ class EpcPropertyDataMapper:
@staticmethod
def from_rdsap_schema_20_0_0(schema: RdSapSchema20_0_0) -> EpcPropertyData:
es = schema.sap_energy_source
# `photovoltaic_supply` is polymorphic in 20.0.0 too (None / wrapper dict /
# measured-array LIST) — route it through the shared dispatch like 21.0.0 so a
# cert lodging measured arrays as a list (e.g. cert 6102-6227-8000-0083-2292,
# uprn 22086693) maps instead of raising "'list' object has no attribute
# 'none_or_no_details'" and sinking the whole prediction cohort.
pv_supply, pv_arrays = _map_schema_21_pv(es.photovoltaic_supply)
# ADR-0027: instantaneous_wwhrs holds bath/shower ROOM counts.
iw = schema.sap_heating.instantaneous_wwhrs
return EpcPropertyData(
@ -1737,16 +1729,8 @@ class EpcPropertyDataMapper:
is_dwelling_export_capable=False,
wind_turbines_terrain_type=str(es.wind_turbines_terrain_type),
electricity_smart_meter_present=False,
photovoltaic_supply=(
PhotovoltaicSupply(
none_or_no_details=PhotovoltaicSupplyNoneOrNoDetails(
percent_roof_area=es.photovoltaic_supply.none_or_no_details.percent_roof_area,
)
)
if es.photovoltaic_supply
and es.photovoltaic_supply.none_or_no_details
else None
),
photovoltaic_supply=pv_supply,
photovoltaic_arrays=pv_arrays,
),
sap_building_parts=[
SapBuildingPart(
@ -3258,6 +3242,34 @@ def _default_missing_post_town(data: Dict[str, Any]) -> Dict[str, Any]:
return {**data, "post_town": ""}
def _derive_built_form_16x(dwelling_type: Any) -> int:
"""Derive the RdSAP `built_form` code from a cert's `dwelling_type` text when
the numeric field is omitted (some 16.0 certs). The form is stated in the
description for houses/bungalows ("Mid-terrace house" 4, "End-terrace" 3,
"Semi-detached" 2, "Detached" 1). Flats ("Ground-floor flat") do not encode
a form, so fall back to the modal 4 (Mid-terrace) `built_form` is an ML-only
feature the SAP calculator never reads, so the fallback is SAP-neutral.
`dwelling_type` arrives as a plain string or a `{'value': ...}` dict (real-API
shape); both are handled."""
if isinstance(dwelling_type, dict):
dwelling_type = cast(Dict[str, Any], dwelling_type).get("value")
text = str(dwelling_type or "").lower()
if "semi-detached" in text:
return 2
if "detached" in text:
return 1
if "enclosed end" in text:
return 5
if "enclosed mid" in text:
return 6
if "end-terrace" in text:
return 3
if "mid-terrace" in text or "terrace" in text:
return 4
return 4 # flats / unstated form → modal built_form (SAP- and gate-neutral)
def _normalize_sap_schema_16_x(data: Dict[str, Any]) -> Dict[str, Any]:
"""Rewrite a `SAP-Schema-16.2`/`16.3` API doc onto the `RdSAP-Schema-17.1`
shape so it can reuse the tested `from_rdsap_schema_17_1` mapper.
@ -3294,6 +3306,13 @@ def _normalize_sap_schema_16_x(data: Dict[str, Any]) -> Dict[str, Any]:
if isinstance(windows, list) and windows:
window_list: List[Any] = cast(List[Any], windows)
d.setdefault("window", window_list[0])
elif isinstance(windows, dict):
# Some 16.x certs (e.g. "High performance glazing" certs 9768-…-7974 /
# 8257-…-4992) lodge `windows` as a single dict rather than a list — take
# it directly. (These particular certs still fail loud below: they also omit
# door_count / habitable_room_count / glazed_area — genuinely insufficient
# data — but a windows-as-dict cert that DOES carry the rest now maps.)
d.setdefault("window", windows)
d.setdefault("schema_version_original", d.get("schema_version", ""))
# Some 16.x certs (notably 16.0) omit `tenure` — RdSapSchema17_1 requires it.
# It is address/occupancy metadata the SAP cascade never reads, so default a
@ -3313,6 +3332,18 @@ def _normalize_sap_schema_16_x(data: Dict[str, Any]) -> Dict[str, Any]:
# still fail loud — that is correct, the fabric data is insufficient.
d.setdefault("insulated_door_count", 0)
# Some 16.x certs (notably 16.0, e.g. cert 8742-6624-9300-2780-4926) lodge
# `property_type`/`dwelling_type` but omit `built_form` — RdSapSchema17_1
# requires it. `built_form` is an ML-only feature (the SAP-10 calculator never
# reads it — only `sap10_ml/transform`), so it is SAP-neutral; and a default is
# gate-neutral here (verified — the component-accuracy fixture has no
# built-form-missing cert, so unlike `multiple_glazed_proportion` it does not
# tip the donor pool). DERIVE it from the dwelling_type description where the
# form is stated (houses), falling back to the modal value for flats (whose
# dwelling_type — "Ground-floor flat" — does not encode form).
if "built_form" not in d:
d["built_form"] = _derive_built_form_16x(d.get("dwelling_type"))
# NB: we deliberately do NOT default `multiple_glazed_proportion` here. A 16.x
# cert that omits it (e.g. 16.3 cert 0418-3986-7250-2884-7970) is left to fail
# the RdSapSchema17_1 parse and be handled by the cohort skip-and-report path —

View file

@ -191,6 +191,30 @@ class TestFromRdSapSchema20_0_0:
schema = from_dict(RdSapSchema20_0_0, load("20_0_0.json"))
return EpcPropertyDataMapper.from_rdsap_schema_20_0_0(schema)
def test_photovoltaic_supply_as_dict_list_is_mapped_not_crashed(self) -> None:
# 20.0.0 types `photovoltaic_supply` as the wrapper only (not the 21.0.x
# Union), so a cert lodging measured arrays as a LIST leaves the leaves as
# raw dicts. The mapper previously did `es.photovoltaic_supply.none_or_no_
# details` → "'list' object has no attribute 'none_or_no_details'", sinking
# the whole prediction cohort. Regression for cert 6102-6227-8000-0083-2292
# (uprn 22086693). It now routes through `_map_schema_21_pv`, whose
# dict-tolerant array reader captures the arrays.
data = load("20_0_0.json")
data["sap_energy_source"]["photovoltaic_supply"] = [
[{"pitch": 2, "peak_power": {"value": 1.14, "quantity": "kW"},
"orientation": 5, "overshading": 1}],
[{"pitch": 2, "peak_power": 1.14, "orientation": 5, "overshading": 1}],
]
schema = from_dict(RdSapSchema20_0_0, data)
result = EpcPropertyDataMapper.from_rdsap_schema_20_0_0(schema)
arrays = result.sap_energy_source.photovoltaic_arrays
assert arrays is not None and len(arrays) == 2
assert arrays[0].peak_power == 1.14
assert arrays[0].orientation == 5
assert result.sap_energy_source.photovoltaic_supply is None
def test_uprn(self, result: EpcPropertyData) -> None:
assert result.uprn == 12457

View file

@ -523,6 +523,32 @@ class TestFromSapSchema16_2:
assert epc.uprn == 100020933894
assert Sap10Calculator().calculate(epc).sap_score == 61 # lodged 56
def test_16_x_missing_built_form_is_derived_from_dwelling_type(self) -> None:
# Some 16.0 certs (e.g. cert 8742-6624-9300-2780-4926) lodge
# `property_type`/`dwelling_type` but omit `built_form`, which RdSapSchema17_1
# requires — previously aborted the prediction cohort. The normaliser derives
# it from the dwelling_type text ("Semi-detached house" → 2). built_form is
# ML-only (SAP calc never reads it), so this is SAP- and gate-neutral.
data = load("sap_16_0.json")
del data["built_form"]
assert data["dwelling_type"] == "Semi-detached house"
epc = EpcPropertyDataMapper.from_api_response(data)
assert isinstance(epc, EpcPropertyData)
assert epc.built_form == "2"
def test_16_x_missing_built_form_for_flat_falls_back_to_modal(self) -> None:
# A flat's dwelling_type ("Ground-floor flat") does not encode a form, so the
# derivation falls back to the modal 4 (Mid-terrace) — keeps the cert mappable.
data = load("sap_16_0.json")
del data["built_form"]
data["dwelling_type"] = "Ground-floor flat"
epc = EpcPropertyDataMapper.from_api_response(data)
assert epc.built_form == "4"
def test_16_x_missing_insulated_door_count_defaults_to_zero(self) -> None:
# Some 16.x certs lodge `door_count` but omit `insulated_door_count`,
# which RdSapSchema17_1 requires — previously raised "missing required