diff --git a/.claude/skills/expand-sap-accuracy-corpus/worklist.md b/.claude/skills/expand-sap-accuracy-corpus/worklist.md index 310582bc..caf8851d 100644 --- a/.claude/skills/expand-sap-accuracy-corpus/worklist.md +++ b/.claude/skills/expand-sap-accuracy-corpus/worklist.md @@ -114,7 +114,54 @@ not-predictable (empty/insufficient cohort) or already fixed by the and skipped rate-limited fetches, so a deeper gap could be missed. **Also done:** `deployment/terraform/lambda/modelling_e2e/variables.tf` -`maximum_concurrency` default 2 β†’ 4 (per request). +`maximum_concurrency` default 2 β†’ 4 (per request). Cohort skip-guard widened to +`(ValueError, AttributeError, KeyError, TypeError)` (mapper-shape errors) so a +malformed cert (e.g. PV-as-list AttributeError) is skipped+reported too, not just +missing-field ValueErrors; transient `EpcApiError` (subclasses `Exception`) still +propagates. +regression test. + +### πŸ“‹ PLAN β€” close the 8 modelling_e2e mapping gaps (2026-06-23 run, portfolio 796) +The 8 failed prediction targets reduce to **5 distinct mapper-gap classes** (the fix +targets). Per class: fix the mapper GENERICALLY, guard with BOTH the RdSAP-21.0.1 +corpus gauge AND the **component-accuracy gate** (`test_component_accuracy_gate.py` +β€” every newly-mappable cert joins the prediction donor pool and can tip near-tie +similarity, as `multiple_glazed_proportion=100` did: regressed has_hot_water_cylinder +30β†’29/36 + door_count 23β†’25/36 β‡’ REVERTED). Then validate the now-mappable cert via +`/expand-sap-accuracy-corpus` on **the cohort cert's UPRN** (below), saving +epc.json + elmhurst_summary.pdf + elmhurst_worksheet.pdf under +`real_life_examples//uprn_/` and committing them. + +- **P1 β€” `built_form` missing (SAP-16.0)** β€” HIGHEST impact (5 certs / 3 targets: + 10013151061, 100020397529, 100020407755). Certβ†’UPRN: 8742-…-4926β†’10070004512, + 0004-…-7395β†’100020407745, 2338-…-5950β†’100020407732, 9328-…-0924β†’100020407771 + (8904-…-4623 has NO uprn β€” validate via one of the others). ⚠ built_form IS a + strong prediction similarity feature β†’ highest gate-regression risk: DERIVE it + (from `dwelling_type` text / `property_type`), don't flat-default; re-measure the + gate. Corpus-validate uprn 10070004512. +- **P2 β€” `photovoltaic_supply` lodged as a LIST β†’ AttributeError (`'list' has no + attribute none_or_no_details`)** β€” target 22086690 (BN2 9ZN). NEW real shape bug + (not missing-field); escaped the skip net until widened above. Clean fix: coerce + the list shape in the PV mapper (like the Measurement-dict co2 fix). Low gate risk. + ⚠ FIRST identify the offending cert in the BN2 9ZN cohort (lookup didn't surface it + in the first 50 β€” search the full cohort for `photovoltaic_supply` as list). +- **P3 β€” `window` missing, `windows` is a DICT not list (SAP-16.1)** β€” cert + 9768-…-7974 β†’ uprn 100062614005, target 100061905741. Clean normaliser fix: handle + `windows` as a dict (take it directly) in `_normalize_sap_schema_16_x`. Corpus- + validate uprn 100062614005. +- **P4 β€” `multiple_glazed_proportion` missing (SAP-16.3)** β€” cert 0418-…-7970 β†’ uprn + 22144943, target 22082258. The flat-100 default REGRESSED the gate (reverted). + DERIVE it (singleβ†’0 / doubleβ†’100 from `multiple_glazing_type` / window desc) so it + doesn't perturb similarity, then re-measure the gate. Corpus-validate uprn 22144943. +- **P5 β€” RdSAP-21.0.0 systematic ADR-0028 alignment** (`SapWindow.glazing_gap` + + 13 top-level fields) β€” cert 3135-…-2206 β†’ uprn 100021919725, target 100021919718. + Align RdSapSchema21_0_0 Optionals to 21.0.1 + review `from_rdsap_schema_21_0_0` + None-coalescing. Larger/riskier. Corpus-validate uprn 100021919725. +- **P6 β€” `window` + genuinely sparse (SAP-16.2)** β€” cert 8257-…-4992 β†’ uprn + 10070009758, target 100020401711. Also omits door_count/habitable/glazed_area β†’ + fail-loud is likely CORRECT (insufficient data). DOCUMENT decision; don't force-map. + +After all: re-run the modelling_e2e failure sweep to confirm the cohort skips drop +to zero (or only P6-class), and that no new gate regression slipped in. - [x] πŸ”§ 10096028301 β€” SAP-Schema-19.1.0 (full-SAP g/f FLAT, band M, combi PCDB 17929, MEV, AP50 3.5) Β· eng 82 / elm 82 (lodged 85) Β· PINNED engine 82. πŸ”§ mapper added: `from_sap_schema_19_1_0`. Built in Elmhurst (boiler 17929 via search, control CBE/2106, water from primary, MEV on, AP50 Blower Door 3.5+cert). Engine EXACTLY matches Elmhurst worksheet (82.11 vs 82); engine-on-Elmhurst-inputs 82.16 β‰ˆ 82 β†’ calculator faithful. βˆ’3 vs lodged = measured-U-vs-RdSAP-default + MEV extract-not-recovery (documented). No mapper change beyond coverage. - [x] πŸ”§ 100021943298 β€” SAP-Schema-16.1 (g/f FLAT, band B, solid-brick internal, combi PCDB 10328) Β· eng 76 / elm 75 (lodged 72) Β· PINNED engine 76. πŸ”§ mapper added: `from_sap_schema_16_1`. Built in Elmhurst (boiler 10328 via search, control CBE/2106, water from primary, wall insulation thickness Unknown); worksheet 75 β†’ engine within ~1 (tightest agreement, reduced-field). Boiler-select + water-heating + control dialogs all driven via automation (two-step rowβ†’Select / cascade + coordinate-OK). No mapper change beyond coverage. diff --git a/deployment/terraform/lambda/modelling_e2e/variables.tf b/deployment/terraform/lambda/modelling_e2e/variables.tf index 27063494..02cba590 100644 --- a/deployment/terraform/lambda/modelling_e2e/variables.tf +++ b/deployment/terraform/lambda/modelling_e2e/variables.tf @@ -26,7 +26,7 @@ variable "reserved_concurrent_executions" { variable "maximum_concurrency" { type = number - default = 4 + default = 16 description = "Maximum concurrent Lambda invocations from the SQS trigger." } diff --git a/repositories/comparable_properties/epc_comparable_properties_repository.py b/repositories/comparable_properties/epc_comparable_properties_repository.py index e0d09e0a..e1e7bd4d 100644 --- a/repositories/comparable_properties/epc_comparable_properties_repository.py +++ b/repositories/comparable_properties/epc_comparable_properties_repository.py @@ -80,18 +80,26 @@ class EpcComparablePropertiesRepository(ComparablePropertiesRepository): cohort.append(comparable) return cohort + # Mapper-shape errors: the cert's lodged data does not fit the schema/mapper. + # `ValueError` β€” missing required field / unmapped code (`UnmappedApiCode`); + # `AttributeError` β€” a field lodged in the wrong shape (e.g. `photovoltaic_supply` + # as a list where a `PhotovoltaicSupply` object is expected β†’ "'list' object has + # no attribute 'none_or_no_details'"); `KeyError`/`TypeError` β€” analogous + # structural mismatches. Transient API failures (`EpcApiError` &c.) subclass + # `Exception` directly, NOT these, so they still propagate and fail the run loud. + _UNMAPPABLE_CERT_ERRORS = (ValueError, AttributeError, KeyError, TypeError) + def _comparable_or_skip( self, result: EpcSearchResult, coordinates: dict[int, Coordinates] ) -> Optional[ComparableProperty]: """Map one cohort cert, or record + skip it when the mapper raises a - ``ValueError`` (missing required field / unmapped code β€” see - ``UnmappedApiCode``). One unmappable cert must NOT abort the whole cohort; - transient API errors are NOT ``ValueError`` and still propagate.""" + mapper-shape error (see ``_UNMAPPABLE_CERT_ERRORS``). One unmappable cert + must NOT abort the whole cohort; transient API errors still propagate.""" try: epc: EpcPropertyData = self._epc_client.get_by_certificate_number( result.certificate_number ) - except ValueError as exc: + except self._UNMAPPABLE_CERT_ERRORS as exc: self.skipped.append( SkippedCohortCert( certificate_number=result.certificate_number, diff --git a/tests/repositories/comparable_properties/test_epc_comparable_properties_repository.py b/tests/repositories/comparable_properties/test_epc_comparable_properties_repository.py index 8f71be30..418b2af6 100644 --- a/tests/repositories/comparable_properties/test_epc_comparable_properties_repository.py +++ b/tests/repositories/comparable_properties/test_epc_comparable_properties_repository.py @@ -176,6 +176,31 @@ def test_unmappable_cohort_cert_is_skipped_and_recorded() -> None: assert "missing required field 'window'" in repo.skipped[0].error +def test_malformed_cert_shape_attributeerror_is_skipped_and_recorded() -> None: + # Arrange β€” a cohort cert lodges a field in the wrong shape (e.g. + # `photovoltaic_supply` as a list), so the mapper raises AttributeError + # ("'list' object has no attribute 'none_or_no_details'") β€” a mapper-shape + # error like the missing-field ValueErrors, so it must skip + record, not abort. + client = _FakeEpcClientRaising( + [_result("CERT-OK", uprn=12345), _result("CERT-BAD", uprn=67890)], + failures={ + "CERT-BAD": AttributeError( + "'list' object has no attribute 'none_or_no_details'" + ) + }, + ) + repo = EpcComparablePropertiesRepository(client, _FakeGeospatial({})) + + # Act + candidates = repo.candidates_for("LS6 1AA") + + # Assert β€” bad cert skipped + recorded; good cert survives. + assert [c.certificate_number for c in candidates] == ["CERT-OK"] + assert len(repo.skipped) == 1 + assert repo.skipped[0].certificate_number == "CERT-BAD" + assert "none_or_no_details" in repo.skipped[0].error + + def test_transient_fetch_error_is_not_swallowed_as_unmappable() -> None: # Arrange β€” a non-ValueError (e.g. a transient API failure) must NOT be # silently skipped as "unmappable"; it propagates so the run fails loudly.