`from_elmhurst_site_notes` hard-coded `extensions_count=0` regardless of
how many extensions the survey lodged. The 6 cohort certs from Slices
47-53 all happened to have 0-2 extensions whose count nothing
load-bearing read, so this latent bug was invisible. Cert 001479
(Summary_001479.pdf, GOV.UK EPB cert 0535-9020-6509-0821-6222) has Main
+ Extension 1 + Extension 2 and is the first cohort cert with a real
API counterpart — accurate `extensions_count` becomes load-bearing the
moment the cross-mapper parity assertion compares API vs Elmhurst
EpcPropertyData side by side.
No SAP-cascade impact (the cascade iterates `sap_building_parts`, not
`extensions_count`) — but a real data-integrity bug surfaced by the
cross-mapper diff. Adds Summary_001479.pdf as a new chain-test fixture
and `_SUMMARY_001479_PDF` constant for follow-up slices that will
land per-bp ages, exposed floors, secondary-heating SAP codes, etc.
All 9 chain tests green; 321 mapper/site-notes/rdsap tests green;
pyright net-zero (35-error baseline preserved on mapper.py).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The §11 Windows table in the Summary PDF doesn't lay out identically
across the cohort. Three new quirks added to the layout-style parser
so the remaining 5 certs can be debugged with windows actually
extracted:
1. `Wood 0.70` combined frame_type+frame_factor line — previously the
parser expected them on separate lines (data+1 / data+2) and
rejected the window when the joined form appeared.
2. Trailing glazing-type on the data line — `1.22 1.76 2.15 Double
pre 2002` is the joined-cell variant in 000516; the W/H/Area
anchor now captures the trailing phrase as an optional 4th group
and feeds it through as `inline_glazing_type`, bypassing the
separate-line glazing-prefix scan.
3. Cross-window gap with no glazing marker — `_partition_after_manuf`
now falls back to "second orientation token in gap" when no
glazing-type-prefix word appears. Covers the 000516 layout where
each window has prefix+suffix orient tokens (no inline orient)
and the glazing-type is joined-to-data.
The 5 remaining Summary PDFs are copied into
`backend/documents_parser/tests/fixtures/` ready for per-cert mapper
work. Mirror pin tests deferred — each cert still has its own diff
to close (handover in NEXT_AGENT_PROMPT.md documents the per-cert
state, e.g. 000477 needs secondary-heating extraction, 000516 needs
roof-window separation).
Current cohort SAP deltas vs the U985 worksheet PDFs (target 1e-4):
000474 0.0000 ✓
000477 +6.3655 secondary heating + lighting
000480 +8.2695 diagnosis pending
000487 +8.1433 extractor still drops windows
000490 +5.6551 diagnosis pending
000516 +5.9812 roof-window separation
Wider regression stays green (754 pass). Pyright net-zero on
touched files.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The 6 worksheet fixtures build EpcPropertyData by hand, validating the cascade in isolation from the mapper. This commit lands the first half of the OTHER validation: Summary_000474.pdf → ElmhurstSiteNotesExtractor → from_elmhurst_site_notes → EpcPropertyData, asserting it produces the same shape as the hand-built fixture. Test is strict-xfail on sap_building_parts count (mapper produces 1, cert lodges 3). Includes a pdftotext-layout preprocessor that converts spatial label/value layout into the Textract-style sequence the existing extractor expects (test-only). Full punch list of 28 mapper-output diffs captured in project memory.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>