Model/backend/documents_parser
Khalim Conn-Kowlessar a76af2ec2f Slice 99a: Elmhurst extractor — no attachment line for flats
Cert 9501 (Summary_000784.pdf) is a flat. The Elmhurst Summary's
§1.0 "Property type" section lodges the built-form descriptor
("M Mid-Terrace", "D Detached", ...) only for houses — flats have no
attachment line, and the §2.0 "Number of Storeys" header follows
immediately after the "F Flat" property-type value.

The extractor's prior `_extract_attachment` regex captured the line
right after the property-type value unconditionally, so cert 9501
ended up with `attachment="2.0 Number of Storeys:"` — section-header
noise that the mapper surfaced on `EpcPropertyData.built_form`.
Downstream, this broke the cascade's `_dwelling_exposure` routing
(no prefix match → defaulted to fully-exposed houses) and so the
cert 9501 Summary path was Δ -5.25 SAP vs worksheet 68.5252.

Detect section-header noise via the leading `<digit>.<digit> `
pattern and the "Number of Storeys" substring; return "" in that
case so flats produce empty `built_form`. Houses still pick up their
real attachment (cohort 0330's "M Mid-Terrace" remains correct).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 21:16:01 +00:00
..
handler address JTK review comments 2026-04-20 15:11:17 +00:00
tests Slice 99a: Elmhurst extractor — no attachment line for flats 2026-05-26 21:16:01 +00:00
__init__.py Map to RdSapSiteNotes from site notes JSON 🟥 2026-04-16 13:54:03 +00:00
db_writer.py include updating epc_property_data to pashub to ara workflow 2026-04-29 09:55:14 +00:00
elmhurst_extractor.py Slice 99a: Elmhurst extractor — no attachment line for flats 2026-05-26 21:16:01 +00:00
extractor.py Handle wall thickness "Unmeasurable" 🟩 2026-04-30 16:41:16 +00:00
local_runner.py update local runner to work for elmhurst 2026-04-24 14:01:36 +00:00
parser.py load ecmk site notes to db 2026-04-29 11:20:47 +00:00
pdf.py update local runner to work for elmhurst 2026-04-24 14:01:36 +00:00