mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
Two coupled bugs surfaced by cert 001479's mains-gas-fire secondary heating (Summary §14.1 lodges "SAP code 605, Flush fitting live effect gas fire" → fuel 26 mains gas): 1. **Mapper**: `_map_elmhurst_sap_heating` only set `secondary_heating_type` (the SAP code int) — `secondary_fuel_type` stayed None. The Summary PDF doesn't lodge the fuel int separately; it has to be derived from the SAP code range. Add `_elmhurst_secondary_fuel_from_sap_code`: codes 601-630 → 26 (mains gas); other codes return None (the cascade defaults to electric, matching cohort 000490 SAP code 691 electric panel). 2. **Cascade**: `_fuel_cost` in cert_to_inputs hardcoded `secondary_high_rate_gbp_per_kwh = other_uses_gbp_per_kwh` (the standard-electricity tariff) regardless of `secondary_fuel_type`. For gas secondaries this charged 1846 kWh/yr at electric rate (£0.132/kWh = £243) instead of gas rate (£0.0348/kWh = £64) — a ~£175/yr ECF distortion ≈ 9 SAP points on cert 001479. Route the cost through `table_32_unit_price_p_per_kwh(secondary_fuel)` when lodged. Worksheet line (242) confirms the gas pricing: `Space heating - secondary 2025.93 3.4800 70.5022` Cert 001479 chain pin delta narrows: SAP_continuous 61.39 → 70.64 (was −7.62 vs 69.0094, now +1.63 — overshooting target by 1.63 SAP). The remaining overshoot maps to the cascade's ~16 W/K HLC undercount (cascade HLP 2.89 vs worksheet 3.13 × TFA) — work for follow-up slices. Cohort 6 chain certs still green at 1e-4 (all-electric or no- secondary). Golden cohort: cert 0300-2747 (mains-gas secondary) SAP residual tightens −7 → +2 — biggest single SAP improvement on the golden cohort to date; pin updated and notes annotated. Other 7 golden certs unchanged (None or electric secondary fuel). Pyright net-zero (35 baseline each on mapper.py + cert_to_inputs.py). Chain pin `test_summary_001479_full_chain_sap_matches_worksheet_pdf_ exactly` is the load-bearing RED — committed failing per TDD; closes to GREEN once the HLC undercount lands. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
380 lines
17 KiB
Python
380 lines
17 KiB
Python
"""End-to-end validation for the Elmhurst Summary→EpcPropertyData chain.
|
|
|
|
The 6 Elmhurst worksheet fixtures in `domain.sap.worksheet.tests`
|
|
build their `EpcPropertyData` synthetically — they validate the
|
|
calculator + cascade in isolation from the mapper. This file pins
|
|
the OTHER half of the chain: `from_elmhurst_site_notes` must produce
|
|
a calculator-equivalent `EpcPropertyData` when fed the Summary PDF
|
|
the worksheet was generated from. Together with the worksheet
|
|
cascade tests, this closes the loop: extractor + mapper + cascade
|
|
+ calculator validated end-to-end against the authoritative
|
|
Elmhurst documents.
|
|
|
|
Status: GREEN. For cert U985-0001-000474, this pipeline produces an
|
|
unrounded SAP within 0.5 of the worksheet PDF's `62.2584` (line 257).
|
|
The cascade itself reproduces Elmhurst's calculator exactly on
|
|
hand-built inputs (handbuilt → 62.2584 to 4 d.p.); the remaining
|
|
sub-half-point gap from the mapped path is non-load-bearing field
|
|
drift (e.g. central_heating_pump_age the Summary PDF doesn't lodge).
|
|
|
|
Preprocessing: the existing `ElmhurstSiteNotesExtractor` was written
|
|
against Textract-style output (label\\nvalue pairs in spatial
|
|
reading order). We don't have Textract in the test environment, so
|
|
this helper converts `pdftotext -layout` output (label-whitespace-
|
|
value on a single line) into the Textract-style sequence the
|
|
extractor expects. Test-only preprocessing; production runs through
|
|
Textract directly.
|
|
"""
|
|
|
|
from __future__ import annotations
|
|
|
|
import re
|
|
import subprocess
|
|
from pathlib import Path
|
|
|
|
from backend.documents_parser.elmhurst_extractor import ElmhurstSiteNotesExtractor
|
|
from datatypes.epc.domain.mapper import EpcPropertyDataMapper
|
|
from domain.sap.calculator import calculate_sap_from_inputs
|
|
from domain.sap.rdsap.cert_to_inputs import SAP_10_2_SPEC_PRICES, cert_to_inputs
|
|
|
|
_FIXTURES = Path(__file__).parent / "fixtures"
|
|
_SUMMARY_000474_PDF = _FIXTURES / "Summary_000474.pdf"
|
|
_SUMMARY_000477_PDF = _FIXTURES / "Summary_000477.pdf"
|
|
_SUMMARY_000480_PDF = _FIXTURES / "Summary_000480.pdf"
|
|
_SUMMARY_000487_PDF = _FIXTURES / "Summary_000487.pdf"
|
|
_SUMMARY_000490_PDF = _FIXTURES / "Summary_000490.pdf"
|
|
_SUMMARY_000516_PDF = _FIXTURES / "Summary_000516.pdf"
|
|
_SUMMARY_001479_PDF = _FIXTURES / "Summary_001479.pdf"
|
|
|
|
|
|
def _summary_pdf_to_textract_style_pages(pdf_path: Path) -> list[str]:
|
|
"""Convert a Summary PDF into the per-page text format the existing
|
|
`ElmhurstSiteNotesExtractor` expects (label\\nvalue sequences).
|
|
|
|
`pdftotext -layout` preserves the spatial pairing of label and value
|
|
on each line; we split each line on 2+ spaces to surface the
|
|
label/value tokens, then concatenate them back into a single
|
|
newline-delimited stream per page.
|
|
"""
|
|
info = subprocess.run(
|
|
["pdfinfo", str(pdf_path)], capture_output=True, text=True, check=True
|
|
).stdout
|
|
m = re.search(r"Pages:\s+(\d+)", info)
|
|
if m is None:
|
|
raise RuntimeError(f"Could not parse page count from {pdf_path}")
|
|
page_count = int(m.group(1))
|
|
|
|
pages: list[str] = []
|
|
for i in range(1, page_count + 1):
|
|
layout = subprocess.run(
|
|
[
|
|
"pdftotext", "-layout", "-f", str(i), "-l", str(i),
|
|
str(pdf_path), "-",
|
|
],
|
|
capture_output=True, text=True, check=True,
|
|
).stdout
|
|
tokens: list[str] = []
|
|
for line in layout.splitlines():
|
|
if not line.strip():
|
|
tokens.append("")
|
|
continue
|
|
parts = [p for p in re.split(r"\s{2,}", line.strip()) if p]
|
|
tokens.extend(parts)
|
|
pages.append("\n".join(tokens))
|
|
return pages
|
|
|
|
|
|
def test_summary_000474_mapper_produces_three_building_parts() -> None:
|
|
# Arrange — cert U985-0001-000474 is a mid-terrace with 3 building
|
|
# parts (Main + 2 extensions) per the hand-built worksheet fixture
|
|
# at packages/domain/src/domain/sap/worksheet/tests/
|
|
# _elmhurst_worksheet_000474.py. Routing the Summary PDF through
|
|
# extractor + mapper must yield the same count.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert len(epc.sap_building_parts) == 3
|
|
|
|
|
|
def test_summary_000474_mapper_extracts_seven_windows() -> None:
|
|
# Arrange — cert U985-0001-000474's §11 table lodges 7 windows
|
|
# across Main + 1st Extension + 2nd Extension. The legacy Textract-
|
|
# style window parser couldn't anchor on the Summary PDF's tabular
|
|
# layout; the new W/H/Area-plus-Manufacturer anchor pair picks them
|
|
# all up.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert len(epc.sap_windows) == 7
|
|
|
|
|
|
def test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — the full Summary→ElmhurstSiteNotes→EpcPropertyData→cascade
|
|
# →SAP path against the U985-0001-000474 worksheet PDF's unrounded
|
|
# SAP rating (line 257: SAP value 62.2584, rating (258) = 62).
|
|
# Because the Summary PDF carries the same source-of-truth data that
|
|
# the hand-built worksheet fixture encodes by hand, and because the
|
|
# cascade matches Elmhurst's calculator to 4 d.p. on those hand-
|
|
# built inputs, this end-to-end path MUST produce the same unrounded
|
|
# SAP value. Any non-trivial drift = a real mapper bug dropping
|
|
# information from the Summary PDF.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert — within the same 1e-4 tolerance the other Elmhurst worksheet
|
|
# tests pin against. 0.5 is the API-cert residual tolerance (the API
|
|
# publishes rounded SAP integers, so up to half a SAP point is just
|
|
# rounding); for Elmhurst worksheet inputs the cascade reproduces
|
|
# Elmhurst exactly and we expect identical outputs.
|
|
worksheet_unrounded_sap = 62.2584
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_000477_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert U985-0001-000477 is a single-bp mid-terrace with
|
|
# a 15.06 m² Room-in-Roof storey and zero baths lodged. Worksheet
|
|
# PDF lodges unrounded SAP 65.0057. Drives the chain through the
|
|
# `RoomInRoof.detailed_surfaces` cascade with stud walls @ 100mm
|
|
# Mineral, two uninsulated slopes, two party gable walls, plus the
|
|
# RR/storey-area suspended-timber-floor heuristic (RIR < storey →
|
|
# 0.2 ACH floor infiltration).
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000477_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert
|
|
worksheet_unrounded_sap = 65.0057
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_000480_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert U985-0001-000480 is a mid-terrace with main + one
|
|
# extension and a 19.83 m² room-in-roof storey. Worksheet PDF lodges
|
|
# unrounded SAP 61.2986 on line "SAP value". The Detailed §3.10 RR
|
|
# surfaces (2 stud walls @ 0mm + 2 slopes @ 0mm + 1 flat ceiling @
|
|
# 0mm + 2 party gables) plus zero baths drive the chain to 1e-4.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000480_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert
|
|
worksheet_unrounded_sap = 61.2986
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_000487_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert U985-0001-000487 is an enclosed-mid-terrace with
|
|
# main bp + 1st extension, a 21.03 m² Room-in-Roof, an electric
|
|
# shower, and a 1.43 m² Timber Frame alternative wall on the
|
|
# extension. Worksheet PDF lodges unrounded SAP 61.6431. The mapped
|
|
# chain has to thread the alt-wall U-value cascade (Thickness
|
|
# Unknown → cascade falls back to age-band default U=1.9 for thin
|
|
# timber walls) plus the §11 layout variant where the frame_factor
|
|
# appears unprefixed on its own line (no "PVC"/"Wood" frame_type).
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000487_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert
|
|
worksheet_unrounded_sap = 61.6431
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_000516_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert U985-0001-000516 is a mid-terrace with main bp +
|
|
# 19.02 m² room-in-roof. Worksheet PDF lodges unrounded SAP 62.7937.
|
|
# The §11 table mixes 5 vertical windows (U=2.80) with 1 roof
|
|
# window (U=3.10 in cert, U=3.40 Table 24 raw); the mapper
|
|
# discriminates by `U > 3.0` and routes the high-U entry to
|
|
# `sap_roof_windows` so its solar gains feed §6 with the right
|
|
# pitch (45°) and Table-24 U-value.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000516_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert
|
|
worksheet_unrounded_sap = 62.7937
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_000490_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert U985-0001-000490 is an end-terrace with main +
|
|
# 1st extension. The worksheet PDF lodges unrounded SAP 57.3979.
|
|
# End-terrace built-form drives sheltered_sides=1 (RdSAP §S5) and
|
|
# the cert's Summary §14.1 Main Heating2 sub-section carries a
|
|
# secondary heating SAP code (691, electric panel) — both required
|
|
# for the mapped chain to reproduce the worksheet to 1e-4.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000490_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert
|
|
worksheet_unrounded_sap = 57.3979
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|
|
|
|
|
|
def test_summary_001479_mapper_extensions_count_matches_extension_bps() -> None:
|
|
# Arrange — cert 0535-9020-6509-0821-6222 (Summary_001479) is the first
|
|
# cohort cert with an actual GOV.UK API counterpart. Worksheet PDF
|
|
# lodges Main + Extension 1 + Extension 2 (3 building parts, 2
|
|
# extensions). Pre-slice the Elmhurst mapper hard-coded
|
|
# `extensions_count=0` regardless of survey.extensions; this asserts
|
|
# the count flows through.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_001479_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert epc.extensions_count == 2
|
|
assert len(epc.sap_building_parts) == 3
|
|
|
|
|
|
def test_summary_001479_main_party_wall_construction_is_cavity_unfilled() -> None:
|
|
# Arrange — cert 001479 Main §7 Walls lodges "Party Wall Type: CU
|
|
# Cavity masonry unfilled". The Elmhurst leading-code map previously
|
|
# only knew "S" and "C"; "CU" fell through to None, which made the
|
|
# cascade default to U=0.25 instead of the worksheet's lodged U=0.50.
|
|
# The fix adds "CU" → SAP10 wall_construction code 4 (WALL_CAVITY),
|
|
# which `u_party_wall` resolves to U=0.50 — matching the worksheet's
|
|
# §3 `Party walls Main … 0.50` row.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_001479_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert epc.sap_building_parts[0].party_wall_construction == 4
|
|
|
|
|
|
def test_summary_001479_ext2_floor_is_exposed_to_external_air() -> None:
|
|
# Arrange — cert 001479 Ext2 §9 lodges "Location: E To external air"
|
|
# — a cantilevered exposed timber floor (the upper-storey extension
|
|
# over the back garden). The worksheet's §3 row `Exposed floor Ext2
|
|
# … 1.92, 1.20, 1.20` pins this as U=1.20 via Table 20. Pre-slice the
|
|
# mapper only routed "U Above unheated space" through `is_exposed_
|
|
# floor=True`; "E To external air" fell through to the BS EN ISO
|
|
# 13370 ground-floor cascade, dropping the lodged exposure entirely.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_001479_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
ext2 = epc.sap_building_parts[2]
|
|
assert ext2.floor_type == "To external air"
|
|
assert ext2.sap_floor_dimensions[0].is_exposed_floor is True
|
|
|
|
|
|
def test_summary_001479_ext2_sloping_ceiling_roof_uninsulated_for_pre_1950() -> None:
|
|
# Arrange — cert 001479 Ext2 §8 lodges "Type: PS Pitched, sloping
|
|
# ceiling" + "Insulation Thickness: As Built" + age band C (1930-49).
|
|
# Original 1930s construction had no sloping-ceiling insulation;
|
|
# worksheet §3 `External roof Ext2 … 2.30` pins U=2.30 (uninsulated
|
|
# Table 16 row 0). Pre-slice the mapper passed thickness=None through,
|
|
# routing to `u_roof`'s pitched-roof Table 18 col 1 default (0.40 for
|
|
# age C, assumes loft-joist retrofit) — wrong geometry for PS.
|
|
# Ext1's PS roof at age M leaves thickness=None (modern build,
|
|
# cascade default U=0.15 matches worksheet).
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_001479_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert epc.sap_building_parts[2].roof_insulation_thickness == 0
|
|
assert epc.sap_building_parts[1].roof_insulation_thickness is None
|
|
|
|
|
|
def test_summary_001479_secondary_heating_routes_mains_gas_fuel() -> None:
|
|
# Arrange — cert 001479 §14.1 Main Heating2 lodges "Secondary Heating
|
|
# Code: SAP code 605, Flush fitting live effect gas fire, sealed to
|
|
# chimney". The Summary surfaces only the SAP code (605); the fuel
|
|
# type 26 (mains gas) must be derived from the code range so the
|
|
# `_fuel_cost` orchestrator's `secondary_high_rate_gbp_per_kwh`
|
|
# picks up Table 32's gas tariff (£0.0348/kWh) rather than the
|
|
# default standard-electricity tariff (£0.132/kWh). Worksheet line
|
|
# (242) "Space heating - secondary … 3.4800 70.5022" confirms gas
|
|
# pricing.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_001479_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
|
|
# Act
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Assert
|
|
assert epc.sap_heating.secondary_heating_type == 605
|
|
assert epc.sap_heating.secondary_fuel_type == 26
|
|
|
|
|
|
def test_summary_001479_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
|
|
# Arrange — cert 001479 (Summary_001479.pdf / P960-0001-001479.pdf)
|
|
# is the first cohort cert with a real GOV.UK EPB API counterpart
|
|
# (cert ref 0535-9020-6509-0821-6222). Worksheet PDF line "SAP value"
|
|
# lodges unrounded SAP **69.0094** (rating C 69, also the API-
|
|
# published integer). This is the load-bearing forcing function for
|
|
# the API↔Elmhurst parity workstream: any drift from 1e-4 means a
|
|
# mapper gap, not a calculator bug — the cohort 6 cert cascades all
|
|
# reproduce Elmhurst exactly at 1e-4 on hand-built fixtures.
|
|
#
|
|
# Source-data caveat (documented for future debuggers): Summary §3
|
|
# lodges Ext1 age band as "M 2023 onwards"; the worksheet header
|
|
# records "Ext1: L". Likely assessor data-entry inconsistency. The
|
|
# mapper trusts the Summary (its source of truth); accept whatever
|
|
# residual the M vs L disagreement produces.
|
|
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_001479_PDF)
|
|
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
|
|
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
|
|
|
|
# Act
|
|
result = calculate_sap_from_inputs(
|
|
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
|
|
)
|
|
|
|
# Assert — 1e-4 pin, no widening, no xfail (project memory
|
|
# `feedback_zero_error_strict`).
|
|
worksheet_unrounded_sap = 69.0094
|
|
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
|