Commit graph

5 commits

Author SHA1 Message Date
Khalim Conn-Kowlessar
ec9ef0e8bb fix(extractor): drop windows-table header remnant from first window glazing type
Summary PDFs preprocessed from `pdftotext -layout` wrap the windows-table
header across several lines. The third header line's tail ("U value / g
value / Draught Proofed / Permanent Shutters") tokenises to "value value
Proofed Shutters" and lands directly above the FIRST window's data row.

Because the first window in a building part has `before_start = 0`, its
prefix block reaches back into that header remnant. The remnant is
neither an orientation nor a building-part fragment, so it survived the
pops in `_compose_window_descriptors` and leaked into glazing_type as
"value value Proofed Shutters Double between 2002 and 2021" (windows 2-3,
whose prefix starts after the previous window's manufacturer line, were
clean).

Fix: the glazing-type phrase always starts with a glazing-start word
(Single/Double/Triple/Secondary), so trim any prefix fragments preceding
that word before joining the glazing type. Orientation/bp pops still run
on the full prefix, so they are unaffected.

Reproduced from `sap worksheets/Recommendations Elmhurst Files/
cavity_wall_insulation - main wall/before/Summary_001431.pdf`. Added a
regression test driving the real `_extract_windows_from_layout` path with
the verbatim tokenised header+rows. 2306 passed (+4), pyright net-zero.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-02 22:54:49 +00:00
Daniel Roth
8f94bb5435 extract window frame details from elmhurst site notes 🟥 2026-04-27 15:50:25 +00:00
Daniel Roth
444eaa0c06 extract energy fields from elmhurst site notes 🟥 2026-04-27 12:10:40 +00:00
Daniel Roth
f61add9544 Extract Elmhurst site notes to dataclass 🟩 2026-04-24 13:32:08 +00:00
Daniel Roth
1a53a8d83e Extract Elmhurst site notes to dataclass 🟥 2026-04-24 13:13:24 +00:00