Model/backend/documents_parser/tests/test_summary_pdf_mapper_chain.py
Khalim Conn-Kowlessar cb4e31a135 Slice 51: Summary_000516 chain pins SAP at 1e-4; roof-window separation
Three mapper extensions, validated by 000516 closing to 1e-4:

1. Roof-window separation by U-value threshold. Elmhurst Summary PDFs
   pool roof windows into the §11 vertical-window table with no type
   marker. The U-value is the only reliable signal — vertical glazing
   in the cohort tops out at 2.80 W/m²K, while Table 24 roof windows
   start at 3.0+. `_is_elmhurst_roof_window` filters U > 3.0 into
   `sap_roof_windows`; the rest flow through the `sap_windows` path.

2. Table-24 roof-window U-value lookup. The cohort lodges Manufacturer
   U=3.10 for the 000516 roof window, but the worksheet's (27a) line
   (U_eff=2.99) reverse-engineers to a raw U=3.40 — the RdSAP10
   Table 24 "Double pre 2002" roof-window default. `_elmhurst_roof_
   window_u_value` keyed on glazing-type captures the +0.3 W/m²K step;
   falls back to the lodged U for glazing types not yet in the table.

3. `SapWindow.window_width × window_height = lodged Area` convention.
   The Elmhurst Summary PDF carries lodged W (2 d.p.) × lodged H
   (2 d.p.) AND a precomputed Area (2 d.p., not always equal to
   product after rounding). The cascade reads only the W×H product
   across §3 / §5 / §6, so flattening to `(area, 1.0)` keeps the
   downstream area aligned with the worksheet's rounded value rather
   than reconstructing W×H with its own rounding drift (e.g. 1.22 ×
   1.76 = 2.1472 m² vs lodged 2.15 m²). The existing
   `test_first_window_*` tests pinning literal W/H were updated to
   pin the area product (the cascade-relevant invariant).

Cohort state after this slice:

  000474   0.0000  ✓ Slice 47
  000477  +1.1161     Elmhurst floor_ach quirk
  000480   0.0000  ✓ Slice 50
  000487  +1.1844     extractor still drops most §11 windows
  000490   0.0000  ✓ Slice 49
  000516   0.0000  ✓ THIS SLICE

4/6 closed at 1e-4. 756 tests pass; pyright net-zero (35 baseline).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 21:16:46 +00:00

204 lines
8.8 KiB
Python

"""End-to-end validation for the Elmhurst Summary→EpcPropertyData chain.
The 6 Elmhurst worksheet fixtures in `domain.sap.worksheet.tests`
build their `EpcPropertyData` synthetically — they validate the
calculator + cascade in isolation from the mapper. This file pins
the OTHER half of the chain: `from_elmhurst_site_notes` must produce
a calculator-equivalent `EpcPropertyData` when fed the Summary PDF
the worksheet was generated from. Together with the worksheet
cascade tests, this closes the loop: extractor + mapper + cascade
+ calculator validated end-to-end against the authoritative
Elmhurst documents.
Status: GREEN. For cert U985-0001-000474, this pipeline produces an
unrounded SAP within 0.5 of the worksheet PDF's `62.2584` (line 257).
The cascade itself reproduces Elmhurst's calculator exactly on
hand-built inputs (handbuilt → 62.2584 to 4 d.p.); the remaining
sub-half-point gap from the mapped path is non-load-bearing field
drift (e.g. central_heating_pump_age the Summary PDF doesn't lodge).
Preprocessing: the existing `ElmhurstSiteNotesExtractor` was written
against Textract-style output (label\\nvalue pairs in spatial
reading order). We don't have Textract in the test environment, so
this helper converts `pdftotext -layout` output (label-whitespace-
value on a single line) into the Textract-style sequence the
extractor expects. Test-only preprocessing; production runs through
Textract directly.
"""
from __future__ import annotations
import re
import subprocess
from pathlib import Path
from backend.documents_parser.elmhurst_extractor import ElmhurstSiteNotesExtractor
from datatypes.epc.domain.mapper import EpcPropertyDataMapper
from domain.sap.calculator import calculate_sap_from_inputs
from domain.sap.rdsap.cert_to_inputs import SAP_10_2_SPEC_PRICES, cert_to_inputs
_FIXTURES = Path(__file__).parent / "fixtures"
_SUMMARY_000474_PDF = _FIXTURES / "Summary_000474.pdf"
_SUMMARY_000480_PDF = _FIXTURES / "Summary_000480.pdf"
_SUMMARY_000490_PDF = _FIXTURES / "Summary_000490.pdf"
_SUMMARY_000516_PDF = _FIXTURES / "Summary_000516.pdf"
def _summary_pdf_to_textract_style_pages(pdf_path: Path) -> list[str]:
"""Convert a Summary PDF into the per-page text format the existing
`ElmhurstSiteNotesExtractor` expects (label\\nvalue sequences).
`pdftotext -layout` preserves the spatial pairing of label and value
on each line; we split each line on 2+ spaces to surface the
label/value tokens, then concatenate them back into a single
newline-delimited stream per page.
"""
info = subprocess.run(
["pdfinfo", str(pdf_path)], capture_output=True, text=True, check=True
).stdout
m = re.search(r"Pages:\s+(\d+)", info)
if m is None:
raise RuntimeError(f"Could not parse page count from {pdf_path}")
page_count = int(m.group(1))
pages: list[str] = []
for i in range(1, page_count + 1):
layout = subprocess.run(
[
"pdftotext", "-layout", "-f", str(i), "-l", str(i),
str(pdf_path), "-",
],
capture_output=True, text=True, check=True,
).stdout
tokens: list[str] = []
for line in layout.splitlines():
if not line.strip():
tokens.append("")
continue
parts = [p for p in re.split(r"\s{2,}", line.strip()) if p]
tokens.extend(parts)
pages.append("\n".join(tokens))
return pages
def test_summary_000474_mapper_produces_three_building_parts() -> None:
# Arrange — cert U985-0001-000474 is a mid-terrace with 3 building
# parts (Main + 2 extensions) per the hand-built worksheet fixture
# at packages/domain/src/domain/sap/worksheet/tests/
# _elmhurst_worksheet_000474.py. Routing the Summary PDF through
# extractor + mapper must yield the same count.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
# Act
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Assert
assert len(epc.sap_building_parts) == 3
def test_summary_000474_mapper_extracts_seven_windows() -> None:
# Arrange — cert U985-0001-000474's §11 table lodges 7 windows
# across Main + 1st Extension + 2nd Extension. The legacy Textract-
# style window parser couldn't anchor on the Summary PDF's tabular
# layout; the new W/H/Area-plus-Manufacturer anchor pair picks them
# all up.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
# Act
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Assert
assert len(epc.sap_windows) == 7
def test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
# Arrange — the full Summary→ElmhurstSiteNotes→EpcPropertyData→cascade
# →SAP path against the U985-0001-000474 worksheet PDF's unrounded
# SAP rating (line 257: SAP value 62.2584, rating (258) = 62).
# Because the Summary PDF carries the same source-of-truth data that
# the hand-built worksheet fixture encodes by hand, and because the
# cascade matches Elmhurst's calculator to 4 d.p. on those hand-
# built inputs, this end-to-end path MUST produce the same unrounded
# SAP value. Any non-trivial drift = a real mapper bug dropping
# information from the Summary PDF.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Act
result = calculate_sap_from_inputs(
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
)
# Assert — within the same 1e-4 tolerance the other Elmhurst worksheet
# tests pin against. 0.5 is the API-cert residual tolerance (the API
# publishes rounded SAP integers, so up to half a SAP point is just
# rounding); for Elmhurst worksheet inputs the cascade reproduces
# Elmhurst exactly and we expect identical outputs.
worksheet_unrounded_sap = 62.2584
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
def test_summary_000480_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
# Arrange — cert U985-0001-000480 is a mid-terrace with main + one
# extension and a 19.83 m² room-in-roof storey. Worksheet PDF lodges
# unrounded SAP 61.2986 on line "SAP value". The Detailed §3.10 RR
# surfaces (2 stud walls @ 0mm + 2 slopes @ 0mm + 1 flat ceiling @
# 0mm + 2 party gables) plus zero baths drive the chain to 1e-4.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000480_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Act
result = calculate_sap_from_inputs(
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
)
# Assert
worksheet_unrounded_sap = 61.2986
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
def test_summary_000516_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
# Arrange — cert U985-0001-000516 is a mid-terrace with main bp +
# 19.02 m² room-in-roof. Worksheet PDF lodges unrounded SAP 62.7937.
# The §11 table mixes 5 vertical windows (U=2.80) with 1 roof
# window (U=3.10 in cert, U=3.40 Table 24 raw); the mapper
# discriminates by `U > 3.0` and routes the high-U entry to
# `sap_roof_windows` so its solar gains feed §6 with the right
# pitch (45°) and Table-24 U-value.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000516_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Act
result = calculate_sap_from_inputs(
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
)
# Assert
worksheet_unrounded_sap = 62.7937
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
def test_summary_000490_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
# Arrange — cert U985-0001-000490 is an end-terrace with main +
# 1st extension. The worksheet PDF lodges unrounded SAP 57.3979.
# End-terrace built-form drives sheltered_sides=1 (RdSAP §S5) and
# the cert's Summary §14.1 Main Heating2 sub-section carries a
# secondary heating SAP code (691, electric panel) — both required
# for the mapped chain to reproduce the worksheet to 1e-4.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000490_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Act
result = calculate_sap_from_inputs(
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
)
# Assert
worksheet_unrounded_sap = 57.3979
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4