Slice S0380.17: map Elmhurst §11 glazing-type labels to SAP10 codes

Closes a systematic +0.02..+0.07 SAP over-prediction on every triple-
glazed cert in cohort 2 (13 of 38) and removes a silent-default
failure mode flagged via cert 3336-2825-9400-0512-8292 (+0.0674 Δ).

Root cause: `_map_elmhurst_window` (datatypes/epc/domain/mapper.py)
was passing the Elmhurst-lodged glazing-type string verbatim into
`SapWindow.glazing_type` (declared `Union[int, str]`). The §5 (66)..
(67) daylight-factor cascade at
`domain/sap10_calculator/worksheet/internal_gains.py:512` requires
`isinstance(w.glazing_type, int)` to look up Table 6b col light g_L —
string lodgings silently fell through to the `_G_LIGHT_DEFAULT = 0.80`
(double-glazed) branch. Cert 3336 (Triple glazed, worksheet "Window,
Triple glazed") got g_L = 0.80 instead of the correct 0.70, inflating
C_daylight from 1.072 to 1.041 → lighting kWh under-predicted by
−4.53 kWh/yr → total fuel cost under by −1.17 GBP → ECF Δ −0.0049 →
SAP continuous over by +0.0674.

Fix: `_ELMHURST_GLAZING_LABEL_TO_SAP10` dict + `_elmhurst_glazing_
type_code` helper translate the Elmhurst Summary §11 lodged strings
to the SAP 10.2 Table U2 integer codes the cascade keys on:

  "Single"                                          → 1
  "Double pre 2002"                                 → 2
  "Double between 2002 and 2021"                    → 3
  "Double with unknown install date"                → 3
  "Double with unknown 16 mm or install date more"  → 3
  "Double post or during 2022"                      → 5
  "Triple post or during 2022"                      → 6
  "Triple post or during"                           → 6  (year-trunc.)
  "Secondary"                                       → 7

Two regex passes strip the layout noise the extractor sometimes folds
into the glazing-type token: a `(?:Part )?value value Proofed Shutters`
prefix (from adjacent column headers) and a ` Summary Information` /
` Alternative wall…` suffix. Verified against the union of cohort-1
(7 certs) + cohort-2 (38 certs) + test-fixture (9 PDFs) glazing
labels: 18 distinct surface forms, all closed by the dict + noise
patterns; one window in cert 2636's Summary_000898.pdf lodged the
year-truncated "Triple post or during" — added as an alias for code 6
per worksheet "Triple glazed" lodging.

Strict-enum gate: `_elmhurst_glazing_type_code` raises
`UnmappedElmhurstLabel("glazing_type", label)` (Slice S0380.15
pattern, extended to the new helper) when the label is None or not
in the dict — surfaces mapper-coverage gaps at extraction time rather
than masking them as a SAP precision floor.

Cohort-2 Summary-path delta progression (38 certs):
  bucket          before slice 2    after slice 2
  exact (<1e-4)   11                11
  <0.005          0                 5     ← 9421 +0.0012, 2536 +0.0016, 9370 +0.0017, 0100 +0.0028, 2800 +0.0044
  0.005-0.07      15                10    ← all triple-glazed
  0.07-0.5        5                 5
  0.5-1           4                 4
  1-5             1                 1
  5+              2                 2
  RAISES          0                 0

3336 (user's flag) closes from +0.0674 → +0.0400 — the residual is
the remaining systematic offset the next slice will investigate.

Tests added (3):
- `test_summary_3336_triple_glazed_windows_route_to_code_6` — pins
  the mapper output for the user's flagged cert.
- `test_summary_000474_double_glazed_windows_route_to_code_3` —
  exercises the DG branch + the year-unknown alias mapping.
- `test_summary_mapper_raises_on_unmapped_glazing_type_label` —
  strict-enum coverage gate via mutated site notes.

Tests updated (1):
- `test_first_window_glazing_type` (test_elmhurst_end_to_end.py):
  asserts int code 5 (DG low-E argon — "Double post or during 2022")
  not the string verbatim. The string-passthrough behaviour was
  always a latent bug; this test was the only direct pin on it.

Pyright net-zero per file:
  - datatypes/epc/domain/mapper.py: 32 (baseline 32)
  - backend/documents_parser/tests/test_summary_pdf_mapper_chain.py: 0
  - backend/documents_parser/tests/test_elmhurst_end_to_end.py: 0

Regression baseline: 694 pass + 10 fail (= prior 691 + 10 + 3 new).
Triple-glazed original-cohort certs are now closer to worksheet too;
the ±0.07 chain tests on the original cohort still hold, and a future
slice tightens them once the next-largest residual is closed.

Spec refs:
- SAP 10.2 Table U2 — glazing-type integer enum.
- SAP 10.2 Table 6b col light — light-transmission g_L by glazing
  type (triple 0.70, double-glazed variants 0.80, single 0.90).
- RdSAP 10 §11 Windows — Summary lodging of glazing type as a
  type+install-date phrase.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-05-27 23:05:52 +00:00 committed by Jun-te Kim
parent 69668ec634
commit 4cfec00f22
4 changed files with 137 additions and 2 deletions

Binary file not shown.

View file

@ -222,7 +222,12 @@ class TestWindows:
assert result.sap_windows[0].orientation == 1
def test_first_window_glazing_type(self, result: EpcPropertyData) -> None:
assert result.sap_windows[0].glazing_type == "Double post or during 2022"
# SAP 10.2 Table U2 glazing-type code: 5 = double glazed (low-E
# argon). The Elmhurst Summary's "Double post or during 2022"
# label maps to code 5 via `_ELMHURST_GLAZING_LABEL_TO_SAP10` —
# the §5 daylight factor + §6 solar gains key off the integer
# not the string.
assert result.sap_windows[0].glazing_type == 5
def test_first_window_draught_proofed(self, result: EpcPropertyData) -> None:
assert result.sap_windows[0].draught_proofed is True

View file

@ -878,6 +878,72 @@ def test_all_seven_ashp_cohort_certs_extract_without_unmapped_label_raise() -> N
EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
def test_summary_3336_triple_glazed_windows_route_to_code_6() -> None:
# Arrange — cert 3336-2825-9400-0512-8292's Summary §11 lodges
# "Triple post or during 2022" on every window; dr87-0001-000888
# confirms "Window, Triple glazed" on every line. The Elmhurst
# mapper must surface SAP 10.2 Table U2 code 6 so the §5 (66)..
# (67) daylight factor uses Table 6b col light g_L = 0.70 instead
# of the default DG g_L = 0.80 — the +0.0274 SAP regression that
# this slice closes is driven by the daylight-factor offset that
# the default-DG silently masked.
pages = _summary_pdf_to_textract_style_pages(
_FIXTURES / "Summary_000888.pdf"
)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
# Act
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Assert — every window on cert 3336 is triple-glazed → code 6.
assert epc.sap_windows, "expected windows on cert 3336"
for w in epc.sap_windows:
assert w.glazing_type == 6
def test_summary_000474_double_glazed_windows_route_to_code_3() -> None:
# Arrange — boiler-cohort cert (Summary_000474.pdf) lodges
# "Double between 2002 and 2021" / "Double with unknown install
# date" on every window. Both routes to SAP 10.2 Table U2 code 3
# (DG air-filled post-2002) per the `_ELMHURST_GLAZING_LABEL_TO
# _SAP10` dict — same Table 6b col light g_L = 0.80 as the
# default, so the cascade SAP is unchanged for these certs, but
# the integer pin guards against future cascade consumers that
# key on the subcode (e.g. a U-value default lookup for absent
# `WindowTransmissionDetails`).
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
# Act
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
# Assert
assert epc.sap_windows, "expected windows on cert 000474"
for w in epc.sap_windows:
assert w.glazing_type == 3, (
f"expected DG post-2002 code 3, got {w.glazing_type!r}"
)
def test_summary_mapper_raises_on_unmapped_glazing_type_label() -> None:
# Arrange — same strict-coverage gate as the cylinder-size helper
# (Slice S0380.15 + S0380.16): silently routing an unknown glazing
# variant to a SAP default int hid the +0.05 SAP regression on 13
# triple-glazed certs until the cohort-2 first-attempt probe. After
# this slice, an unrecognised lodging surfaces immediately at
# extraction time.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000899_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
# Mutate the first window's glazing_type to an unmapped string.
site_notes.windows[0].glazing_type = "Quintuple glazed with helium"
# Act / Assert
with pytest.raises(UnmappedElmhurstLabel) as excinfo:
EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
assert excinfo.value.field == "glazing_type"
assert excinfo.value.value == "Quintuple glazed with helium"
def test_summary_2536_normal_cylinder_routes_to_code_2() -> None:
# Arrange — cert 2536-2525-0600-0788-2292's Summary §15.1 lodges
# "Cylinder Size: Normal". The dr87 worksheet lodges "Cylinder

View file

@ -3201,7 +3201,7 @@ def _map_elmhurst_window(w: ElmhurstWindow) -> SapWindow:
glazing_gap=w.glazing_gap or "",
orientation=_elmhurst_orientation_int(w.orientation),
window_type="Window",
glazing_type=w.glazing_type,
glazing_type=_elmhurst_glazing_type_code(w.glazing_type),
# SapWindow's width × height is consumed across §3 (windows_w_per_
# k), §5 (daylight factor), and §6 (solar gains) — all summed as
# the area product. The Elmhurst Summary PDF lodges W and H to
@ -3458,6 +3458,70 @@ def _elmhurst_cylinder_insulation_code(
return code
# Elmhurst Summary §11 "Windows" lodged glazing-type strings mapped to
# the SAP 10.2 Table U2 glazing-type enum that
# `domain/sap10_calculator/worksheet/internal_gains._G_LIGHT_BY_GLAZING_CODE`
# keys ({1: single (g_L=0.90), 2: DG pre-2002 (0.80), 3: DG post-2002
# (0.80), 5: DG low-E argon (0.80), 6: triple (0.70), 7: secondary
# (0.80)}). Only "Triple" vs everything-else materially affects the
# §5 (66)..(67) daylight factor (Table 6b col light: triple 0.70 vs
# double 0.80) for the Elmhurst path, because the worksheet-lodged
# U-value and g-value are passed through `WindowTransmissionDetails`
# directly — but the canonical SAP code is mapped for parity with the
# API path and forward-compatibility with any future cascade consumer
# that keys on the code.
#
# The trailing-substring-match `_elmhurst_glazing_type_code` strips a
# layout-noise prefix ("value value Proofed Shutters " or "Part value
# value Proofed Shutters ") and suffix (" Summary Information",
# " Alternative wall…") that the extractor occasionally folds into
# the glazing-type token before the cohort-2 dataset was first probed;
# fixing the upstream extractor is deferred to a future slice.
_ELMHURST_GLAZING_LABEL_TO_SAP10: Dict[str, int] = {
"Single": 1,
"Double pre 2002": 2,
"Double between 2002 and 2021": 3,
"Double with unknown install date": 3,
"Double with unknown 16 mm or install date more": 3,
"Double post or during 2022": 5,
"Triple post or during 2022": 6,
# One window in cert 2636 (Summary_000898.pdf) lodges the year-
# truncated form "Triple post or during" — the trailing " 2022 1"
# was consumed by an adjacent "Alternative wall" lodging in the
# PDF table cell the extractor joined into the glazing-type token.
# Treated as the same enum as the full form per worksheet
# "Triple glazed" lodging on cert 2636's dr87-0001-000898.pdf.
"Triple post or during": 6,
"Secondary": 7,
}
_ELMHURST_GLAZING_LABEL_NOISE_PREFIX_RE: Final[re.Pattern[str]] = re.compile(
r"^(?:Part )?value value Proofed Shutters\s+"
)
_ELMHURST_GLAZING_LABEL_NOISE_SUFFIX_RE: Final[re.Pattern[str]] = re.compile(
r"\s+Summary Information$|\s+Alternative wall.*$"
)
def _elmhurst_glazing_type_code(label: Optional[str]) -> int:
"""Map an Elmhurst §11 lodged glazing-type label to the SAP 10.2
Table U2 integer code. Raises `UnmappedElmhurstLabel` when the
label is missing OR present but not in
`_ELMHURST_GLAZING_LABEL_TO_SAP10` (the same strict-coverage gate
Slice S0380.15 established for cylinder labels silently routing
an unknown variant to a SAP-default int hid the triple-glazed Δ
+0.05 SAP regression for 13 cohort-2 certs until extraction was
audited end-to-end)."""
if label is None:
raise UnmappedElmhurstLabel("glazing_type", "<missing>")
cleaned = _ELMHURST_GLAZING_LABEL_NOISE_PREFIX_RE.sub("", label)
cleaned = _ELMHURST_GLAZING_LABEL_NOISE_SUFFIX_RE.sub("", cleaned).strip()
code = _ELMHURST_GLAZING_LABEL_TO_SAP10.get(cleaned)
if code is None:
raise UnmappedElmhurstLabel("glazing_type", label)
return code
def _elmhurst_main_heating_category(
mh: ElmhurstMainHeating, pcdb_index: Optional[int]
) -> Optional[int]: