mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
Four mapper extensions, validated by 000480 closing to 1e-4 and large
gap reductions across 000477/000487/000516.
1. Room-in-Roof support. `ElmhurstSiteNotes` gains `RoomInRoof` +
`RoomInRoofSurface` dataclasses; extractor parses §8.1 (Flat
Ceiling / Stud Wall / Slope / Gable Wall / Common Wall) with
Length × Height + insulation + gable-type + measured-U cells.
Mapper produces a `SapRoomInRoof` with `detailed_surfaces`
attached to the Main bp: Stud Walls / Slopes / Flat Ceilings
route through Table 17 insulation thickness; Gable Walls split
between `gable_wall` (Party → Table 4 U=0.25) and
`gable_wall_external` (Sheltered → assessor-lodged U-value
override, e.g. 000487 Gable Wall 2 at U=0.86). Empty surfaces
(0×0 — the cohort lodges a full 5-pair table) and Common Walls
(handled by cascade's Simplified Type 2 geometry) are dropped.
`total_floor_area_m2` now includes the RR floor area.
2. Party-wall construction mapping. 000516 lodges "S Solid masonry /
timber / system build" which routes to SAP10 wall_construction=3
(Solid Brick → U=0.0 via Table 4). The previous mapper used the
same wall-type table as `wall_construction`, which lacked the
"S" code and fell through to None (cascade default 0.25). Split
into a dedicated `_elmhurst_party_wall_construction_int` keyed
on the party-wall category codes.
3. Roof "None" insulation. When the §8.0 Roofs subsection lodges
"Insulation N None" without a separate "Insulation Thickness"
line, treat thickness as 0 mm so the cascade picks Table 16
row 0 (U=2.30) rather than the age-band default. Closes the
29 W/K roof-loss gap on 000516.
4. `number_baths` lodgement. `SapHeating.number_baths` now reads
`survey.baths_and_showers.number_of_baths`. The cascade defaults
`None → has-bath` for the modal UK case, but explicit `0` lodged
on 000477/000480 (bathless dwellings, rare) drops the bath HW
demand line per Table 1b. Closes 000480's last ~0.3 SAP gap.
Cohort state after this slice (target 1e-4):
000474 0.0000 ✓ Slice 47
000477 +1.1161 Elmhurst floor_ach quirk (true vs false despite
"T Suspended timber" lodged on all certs)
000480 0.0000 ✓ THIS SLICE
000487 +1.1844 extractor still drops most §11 windows on this
layout variant
000490 0.0000 ✓ Slice 49
000516 +0.1774 roof-window separation by U-value heuristic
3/6 certs now closed at 1e-4. Pyright net-zero (35 baseline). Tests
756 pass (added `test_summary_000480_full_chain_sap_matches_worksheet_
pdf_exactly`).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
181 lines
7.8 KiB
Python
181 lines
7.8 KiB
Python
"""End-to-end validation for the Elmhurst Summary→EpcPropertyData chain.
|
|
|
|
The 6 Elmhurst worksheet fixtures in `domain.sap.worksheet.tests`
|
|
build their `EpcPropertyData` synthetically — they validate the
|
|
calculator + cascade in isolation from the mapper. This file pins
|
|
the OTHER half of the chain: `from_elmhurst_site_notes` must produce
|
|
a calculator-equivalent `EpcPropertyData` when fed the Summary PDF
|
|
the worksheet was generated from. Together with the worksheet
|
|
cascade tests, this closes the loop: extractor + mapper + cascade
|
|
+ calculator validated end-to-end against the authoritative
|
|
Elmhurst documents.
|
|
|
|
Status: GREEN. For cert U985-0001-000474, this pipeline produces an
|
|
unrounded SAP within 0.5 of the worksheet PDF's `62.2584` (line 257).
|
|
The cascade itself reproduces Elmhurst's calculator exactly on
|
|
hand-built inputs (handbuilt → 62.2584 to 4 d.p.); the remaining
|
|
sub-half-point gap from the mapped path is non-load-bearing field
|
|
drift (e.g. central_heating_pump_age the Summary PDF doesn't lodge).
|
|
|
|
Preprocessing: the existing `ElmhurstSiteNotesExtractor` was written
|
|
against Textract-style output (label\\nvalue pairs in spatial
|
|
reading order). We don't have Textract in the test environment, so
|
|
this helper converts `pdftotext -layout` output (label-whitespace-
|
|
value on a single line) into the Textract-style sequence the
|
|
extractor expects. Test-only preprocessing; production runs through
|
|
Textract directly.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
import subprocess
|
|
from pathlib import Path
|
|
|
|
from backend.documents_parser.elmhurst_extractor import ElmhurstSiteNotesExtractor
|
|
from datatypes.epc.domain.mapper import EpcPropertyDataMapper
|
|
from domain.sap.calculator import calculate_sap_from_inputs
|
|
from domain.sap.rdsap.cert_to_inputs import SAP_10_2_SPEC_PRICES, cert_to_inputs
|
|
|
|
_FIXTURES = Path(__file__).parent / "fixtures"
|
|
_SUMMARY_000474_PDF = _FIXTURES / "Summary_000474.pdf"
|
|
_SUMMARY_000480_PDF = _FIXTURES / "Summary_000480.pdf"
|
|
_SUMMARY_000490_PDF = _FIXTURES / "Summary_000490.pdf"
|
|
|
|
|
|
def _summary_pdf_to_textract_style_pages(pdf_path: Path) -> list[str]:
|
|
"""Convert a Summary PDF into the per-page text format the existing
|
|
`ElmhurstSiteNotesExtractor` expects (label\\nvalue sequences).
|
|
|
|
`pdftotext -layout` preserves the spatial pairing of label and value
|
|
on each line; we split each line on 2+ spaces to surface the
|
|
label/value tokens, then concatenate them back into a single
|
|
newline-delimited stream per page.
|
|
"""
|
|
info = subprocess.run(
|
|
["pdfinfo", str(pdf_path)], capture_output=True, text=True, check=True
|
|
).stdout
|
|
m = re.search(r"Pages:\s+(\d+)", info)
|
|
if m is None:
|
|
raise RuntimeError(f"Could not parse page count from {pdf_path}")
|
|
page_count = int(m.group(1))
|
|
|
|
pages: list[str] = []
|
|
for i in range(1, page_count + 1):
|
|
layout = subprocess.run(
|
|
[
|
|
"pdftotext", "-layout", "-f", str(i), "-l", str(i),
|
|
str(pdf_path), "-",
|
|
],
|
|
capture_output=True, text=True, check=True,
|
|
).stdout
|
|
tokens: list[str] = []
|
|
for line in layout.splitlines():
|
|
if not line.strip():
|
|
tokens.append("")
|
|
continue
|
|
parts = [p for p in re.split(r"\s{2,}", line.strip()) if p]
|
|
tokens.extend(parts)
|
|
pages.append("\n".join(tokens))
|
|
return pages
|
|
|
|
|
|
def test_summary_000474_mapper_produces_three_building_parts() -> None:
|
|
# Arrange — cert U985-0001-000474 is a mid-terrace with 3 building
|
|
# parts (Main + 2 extensions) per the hand-built worksheet fixture
|
|
# at packages/domain/src/domain/sap/worksheet/tests/
|
|
# _elmhurst_worksheet_000474.py. Routing the Summary PDF through
|
|
# extractor + mapper must yield the same count.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert len(epc.sap_building_parts) == 3
|
|
|
|
|
|
def test_summary_000474_mapper_extracts_seven_windows() -> None:
|
|
# Arrange — cert U985-0001-000474's §11 table lodges 7 windows
|
|
# across Main + 1st Extension + 2nd Extension. The legacy Textract-
|
|
# style window parser couldn't anchor on the Summary PDF's tabular
|
|
# layout; the new W/H/Area-plus-Manufacturer anchor pair picks them
|
|
# all up.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert len(epc.sap_windows) == 7
|
|
|
|
|
|
def test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — the full Summary→ElmhurstSiteNotes→EpcPropertyData→cascade
|
|
# →SAP path against the U985-0001-000474 worksheet PDF's unrounded
|
|
# SAP rating (line 257: SAP value 62.2584, rating (258) = 62).
|
|
# Because the Summary PDF carries the same source-of-truth data that
|
|
# the hand-built worksheet fixture encodes by hand, and because the
|
|
# cascade matches Elmhurst's calculator to 4 d.p. on those hand-
|
|
# built inputs, this end-to-end path MUST produce the same unrounded
|
|
# SAP value. Any non-trivial drift = a real mapper bug dropping
|
|
# information from the Summary PDF.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert — within the same 1e-4 tolerance the other Elmhurst worksheet
|
|
# tests pin against. 0.5 is the API-cert residual tolerance (the API
|
|
# publishes rounded SAP integers, so up to half a SAP point is just
|
|
# rounding); for Elmhurst worksheet inputs the cascade reproduces
|
|
# Elmhurst exactly and we expect identical outputs.
|
|
worksheet_unrounded_sap = 62.2584
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_000480_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert U985-0001-000480 is a mid-terrace with main + one
|
|
# extension and a 19.83 m² room-in-roof storey. Worksheet PDF lodges
|
|
# unrounded SAP 61.2986 on line "SAP value". The Detailed §3.10 RR
|
|
# surfaces (2 stud walls @ 0mm + 2 slopes @ 0mm + 1 flat ceiling @
|
|
# 0mm + 2 party gables) plus zero baths drive the chain to 1e-4.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000480_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert
|
|
worksheet_unrounded_sap = 61.2986
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_000490_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert U985-0001-000490 is an end-terrace with main +
|
|
# 1st extension. The worksheet PDF lodges unrounded SAP 57.3979.
|
|
# End-terrace built-form drives sheltered_sides=1 (RdSAP §S5) and
|
|
# the cert's Summary §14.1 Main Heating2 sub-section carries a
|
|
# secondary heating SAP code (691, electric panel) — both required
|
|
# for the mapped chain to reproduce the worksheet to 1e-4.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000490_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert
|
|
worksheet_unrounded_sap = 57.3979
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|