Adds the (API JSON + Summary PDF) fixtures for cert
0330-2249-8150-2326-4121 — the boiler pilot identified in the
handover. Property: 17 Summerfield Road, MANCHESTER M22 1AE
(mid-terrace house, mains gas boiler PCDB idx 10241, age D).
Source: API JSON fetched via EpcClientService from
https://api.get-energy-performance-data.communities.gov.uk
(OPEN_EPC_API_TOKEN). Summary PDF copied from
`sap worksheets/Additional data with api/0330-2249-8150-2326-4121/
Summary_000897.pdf` (where the user provided the triple).
Worksheet target: SAP 61.5993 (continuous), from `dr87-0001-000897
.pdf` in the same source directory.
Current state on these fixtures (uncommitted before this slice):
- Summary mapper cascade SAP: 62.0660 (Δ +0.4667 vs worksheet)
- API mapper cascade SAP: 63.7446 (Δ +2.1453 vs worksheet)
Both paths RED at 1e-4. Two specific cascade-component gaps
identified in the handover for follow-up slices:
1. Windows HLC +6.71 W/K (API vs Summary) — likely glazing_type=14
not in Slice 93's `_API_GLAZING_TYPE_TO_TRANSMISSION` (only
codes 3 and 13 mapped).
2. HW kWh +1060 (API 3172.65 vs Summary 2112.00) — §4 subsystem
gap; needs occupancy/shower/cylinder probe.
This commit stages the fixtures only — no tests added yet. The
follow-up slice should add a RED Layer 2 test (Summary path 1e-4
vs 61.5993) and proceed slice-by-slice.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
`from_elmhurst_site_notes` hard-coded `extensions_count=0` regardless of
how many extensions the survey lodged. The 6 cohort certs from Slices
47-53 all happened to have 0-2 extensions whose count nothing
load-bearing read, so this latent bug was invisible. Cert 001479
(Summary_001479.pdf, GOV.UK EPB cert 0535-9020-6509-0821-6222) has Main
+ Extension 1 + Extension 2 and is the first cohort cert with a real
API counterpart — accurate `extensions_count` becomes load-bearing the
moment the cross-mapper parity assertion compares API vs Elmhurst
EpcPropertyData side by side.
No SAP-cascade impact (the cascade iterates `sap_building_parts`, not
`extensions_count`) — but a real data-integrity bug surfaced by the
cross-mapper diff. Adds Summary_001479.pdf as a new chain-test fixture
and `_SUMMARY_001479_PDF` constant for follow-up slices that will
land per-bp ages, exposed floors, secondary-heating SAP codes, etc.
All 9 chain tests green; 321 mapper/site-notes/rdsap tests green;
pyright net-zero (35-error baseline preserved on mapper.py).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The §11 Windows table in the Summary PDF doesn't lay out identically
across the cohort. Three new quirks added to the layout-style parser
so the remaining 5 certs can be debugged with windows actually
extracted:
1. `Wood 0.70` combined frame_type+frame_factor line — previously the
parser expected them on separate lines (data+1 / data+2) and
rejected the window when the joined form appeared.
2. Trailing glazing-type on the data line — `1.22 1.76 2.15 Double
pre 2002` is the joined-cell variant in 000516; the W/H/Area
anchor now captures the trailing phrase as an optional 4th group
and feeds it through as `inline_glazing_type`, bypassing the
separate-line glazing-prefix scan.
3. Cross-window gap with no glazing marker — `_partition_after_manuf`
now falls back to "second orientation token in gap" when no
glazing-type-prefix word appears. Covers the 000516 layout where
each window has prefix+suffix orient tokens (no inline orient)
and the glazing-type is joined-to-data.
The 5 remaining Summary PDFs are copied into
`backend/documents_parser/tests/fixtures/` ready for per-cert mapper
work. Mirror pin tests deferred — each cert still has its own diff
to close (handover in NEXT_AGENT_PROMPT.md documents the per-cert
state, e.g. 000477 needs secondary-heating extraction, 000516 needs
roof-window separation).
Current cohort SAP deltas vs the U985 worksheet PDFs (target 1e-4):
000474 0.0000 ✓
000477 +6.3655 secondary heating + lighting
000480 +8.2695 diagnosis pending
000487 +8.1433 extractor still drops windows
000490 +5.6551 diagnosis pending
000516 +5.9812 roof-window separation
Wider regression stays green (754 pass). Pyright net-zero on
touched files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 6 worksheet fixtures build EpcPropertyData by hand, validating the cascade in isolation from the mapper. This commit lands the first half of the OTHER validation: Summary_000474.pdf → ElmhurstSiteNotesExtractor → from_elmhurst_site_notes → EpcPropertyData, asserting it produces the same shape as the hand-built fixture. Test is strict-xfail on sap_building_parts count (mapper produces 1, cert lodges 3). Includes a pdftotext-layout preprocessor that converts spatial label/value layout into the Textract-style sequence the existing extractor expects (test-only). Full punch list of 28 mapper-output diffs captured in project memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>