Model/backend/documents_parser/tests
Khalim Conn-Kowlessar ec9ef0e8bb fix(extractor): drop windows-table header remnant from first window glazing type
Summary PDFs preprocessed from `pdftotext -layout` wrap the windows-table
header across several lines. The third header line's tail ("U value / g
value / Draught Proofed / Permanent Shutters") tokenises to "value value
Proofed Shutters" and lands directly above the FIRST window's data row.

Because the first window in a building part has `before_start = 0`, its
prefix block reaches back into that header remnant. The remnant is
neither an orientation nor a building-part fragment, so it survived the
pops in `_compose_window_descriptors` and leaked into glazing_type as
"value value Proofed Shutters Double between 2002 and 2021" (windows 2-3,
whose prefix starts after the previous window's manufacturer line, were
clean).

Fix: the glazing-type phrase always starts with a glazing-start word
(Single/Double/Triple/Secondary), so trim any prefix fragments preceding
that word before joining the glazing type. Orientation/bp pops still run
on the full prefix, so they are unaffected.

Reproduced from `sap worksheets/Recommendations Elmhurst Files/
cavity_wall_insulation - main wall/before/Summary_001431.pdf`. Added a
regression test driving the real `_extract_windows_from_layout` path with
the verbatim tokenised header+rows. 2306 passed (+4), pyright net-zero.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 22:54:49 +00:00
..
fixtures S0380.191: pin simulated 001431 gas-combi end-to-end at 1e-4 (e2e harness) 2026-06-02 22:44:32 +00:00
__init__.py Map to RdSapSiteNotes from site notes JSON 🟥 2026-04-16 13:54:03 +00:00
test_elmhurst_end_to_end.py Slice S0380.17: map Elmhurst §11 glazing-type labels to SAP10 codes 2026-05-27 23:05:52 +00:00
test_elmhurst_extractor.py fix(extractor): drop windows-table header remnant from first window glazing type 2026-06-02 22:54:49 +00:00
test_end_to_end.py P6.1 follow-on: unbox BuildingPartIdentifier at backend boundaries 2026-05-20 09:58:23 +00:00
test_extractor.py Handle wall thickness "Unmeasurable" 🟩 2026-04-30 16:41:16 +00:00
test_heating_systems_corpus.py S0380.185: record CH6 pin-forever proof — distribution-loss is a Summary-export gap 2026-06-02 19:21:28 +00:00
test_pdf.py rename example site notes to PasHub_ and add Elmhurst example 2026-04-24 13:01:51 +00:00
test_summary_pdf_mapper_chain.py Move sap10_calculator tests to tests/domain/sap10_calculator/ for CI 2026-06-02 16:58:00 +00:00