mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
Summary PDFs preprocessed from `pdftotext -layout` wrap the windows-table
header across several lines. The third header line's tail ("U value / g
value / Draught Proofed / Permanent Shutters") tokenises to "value value
Proofed Shutters" and lands directly above the FIRST window's data row.
Because the first window in a building part has `before_start = 0`, its
prefix block reaches back into that header remnant. The remnant is
neither an orientation nor a building-part fragment, so it survived the
pops in `_compose_window_descriptors` and leaked into glazing_type as
"value value Proofed Shutters Double between 2002 and 2021" (windows 2-3,
whose prefix starts after the previous window's manufacturer line, were
clean).
Fix: the glazing-type phrase always starts with a glazing-start word
(Single/Double/Triple/Secondary), so trim any prefix fragments preceding
that word before joining the glazing type. Orientation/bp pops still run
on the full prefix, so they are unaffected.
Reproduced from `sap worksheets/Recommendations Elmhurst Files/
cavity_wall_insulation - main wall/before/Summary_001431.pdf`. Added a
regression test driving the real `_extract_windows_from_layout` path with
the verbatim tokenised header+rows. 2306 passed (+4), pyright net-zero.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
||
|---|---|---|
| .. | ||
| handler | ||
| tests | ||
| __init__.py | ||
| db_writer.py | ||
| elmhurst_extractor.py | ||
| extractor.py | ||
| local_runner.py | ||
| parser.py | ||
| pdf.py | ||