Commit graph

11 commits

Author SHA1 Message Date
Khalim Conn-Kowlessar
dab59ccfd8 Slice S0380.17: map Elmhurst §11 glazing-type labels to SAP10 codes
Closes a systematic +0.02..+0.07 SAP over-prediction on every triple-
glazed cert in cohort 2 (13 of 38) and removes a silent-default
failure mode flagged via cert 3336-2825-9400-0512-8292 (+0.0674 Δ).

Root cause: `_map_elmhurst_window` (datatypes/epc/domain/mapper.py)
was passing the Elmhurst-lodged glazing-type string verbatim into
`SapWindow.glazing_type` (declared `Union[int, str]`). The §5 (66)..
(67) daylight-factor cascade at
`domain/sap10_calculator/worksheet/internal_gains.py:512` requires
`isinstance(w.glazing_type, int)` to look up Table 6b col light g_L —
string lodgings silently fell through to the `_G_LIGHT_DEFAULT = 0.80`
(double-glazed) branch. Cert 3336 (Triple glazed, worksheet "Window,
Triple glazed") got g_L = 0.80 instead of the correct 0.70, inflating
C_daylight from 1.072 to 1.041 → lighting kWh under-predicted by
−4.53 kWh/yr → total fuel cost under by −1.17 GBP → ECF Δ −0.0049 →
SAP continuous over by +0.0674.

Fix: `_ELMHURST_GLAZING_LABEL_TO_SAP10` dict + `_elmhurst_glazing_
type_code` helper translate the Elmhurst Summary §11 lodged strings
to the SAP 10.2 Table U2 integer codes the cascade keys on:

  "Single"                                          → 1
  "Double pre 2002"                                 → 2
  "Double between 2002 and 2021"                    → 3
  "Double with unknown install date"                → 3
  "Double with unknown 16 mm or install date more"  → 3
  "Double post or during 2022"                      → 5
  "Triple post or during 2022"                      → 6
  "Triple post or during"                           → 6  (year-trunc.)
  "Secondary"                                       → 7

Two regex passes strip the layout noise the extractor sometimes folds
into the glazing-type token: a `(?:Part )?value value Proofed Shutters`
prefix (from adjacent column headers) and a ` Summary Information` /
` Alternative wall…` suffix. Verified against the union of cohort-1
(7 certs) + cohort-2 (38 certs) + test-fixture (9 PDFs) glazing
labels: 18 distinct surface forms, all closed by the dict + noise
patterns; one window in cert 2636's Summary_000898.pdf lodged the
year-truncated "Triple post or during" — added as an alias for code 6
per worksheet "Triple glazed" lodging.

Strict-enum gate: `_elmhurst_glazing_type_code` raises
`UnmappedElmhurstLabel("glazing_type", label)` (Slice S0380.15
pattern, extended to the new helper) when the label is None or not
in the dict — surfaces mapper-coverage gaps at extraction time rather
than masking them as a SAP precision floor.

Cohort-2 Summary-path delta progression (38 certs):
  bucket          before slice 2    after slice 2
  exact (<1e-4)   11                11
  <0.005          0                 5     ← 9421 +0.0012, 2536 +0.0016, 9370 +0.0017, 0100 +0.0028, 2800 +0.0044
  0.005-0.07      15                10    ← all triple-glazed
  0.07-0.5        5                 5
  0.5-1           4                 4
  1-5             1                 1
  5+              2                 2
  RAISES          0                 0

3336 (user's flag) closes from +0.0674 → +0.0400 — the residual is
the remaining systematic offset the next slice will investigate.

Tests added (3):
- `test_summary_3336_triple_glazed_windows_route_to_code_6` — pins
  the mapper output for the user's flagged cert.
- `test_summary_000474_double_glazed_windows_route_to_code_3` —
  exercises the DG branch + the year-unknown alias mapping.
- `test_summary_mapper_raises_on_unmapped_glazing_type_label` —
  strict-enum coverage gate via mutated site notes.

Tests updated (1):
- `test_first_window_glazing_type` (test_elmhurst_end_to_end.py):
  asserts int code 5 (DG low-E argon — "Double post or during 2022")
  not the string verbatim. The string-passthrough behaviour was
  always a latent bug; this test was the only direct pin on it.

Pyright net-zero per file:
  - datatypes/epc/domain/mapper.py: 32 (baseline 32)
  - backend/documents_parser/tests/test_summary_pdf_mapper_chain.py: 0
  - backend/documents_parser/tests/test_elmhurst_end_to_end.py: 0

Regression baseline: 694 pass + 10 fail (= prior 691 + 10 + 3 new).
Triple-glazed original-cohort certs are now closer to worksheet too;
the ±0.07 chain tests on the original cohort still hold, and a future
slice tightens them once the next-largest residual is closed.

Spec refs:
- SAP 10.2 Table U2 — glazing-type integer enum.
- SAP 10.2 Table 6b col light — light-transmission g_L by glazing
  type (triple 0.70, double-glazed variants 0.80, single 0.90).
- RdSAP 10 §11 Windows — Summary lodging of glazing type as a
  type+install-date phrase.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 23:05:52 +00:00
Khalim Conn-Kowlessar
09fb6f1b73 fix: address 22 project-wide test failures from previous sweep
Three orthogonal issues surfaced by the full project test sweep:

1. Dockerfile.test: install poppler-utils alongside postgresql.
   The 20× `pdfinfo: No such file or directory` failures in
   test_summary_pdf_mapper_chain.py traced to the CI test image
   missing the poppler-utils system package (pdfinfo + pdftotext).
   `_summary_pdf_to_textract_style_pages` shells out to these for
   layout-preserving PDF text extraction. Pure-Python alternatives
   (pymupdf, pypdf) don't reproduce pdftotext -layout's row-major
   table cell ordering, which the Elmhurst Summary extractor depends
   on. So system poppler is the right fix; added to apt-get install
   with an explanatory comment.

2. test_from_rdsap_schema.py::test_total_floor_area: expected 55.0,
   got 45.82. Slice 95 (commit f502db8c) changed the API mapper to
   compute total_floor_area_m2 from the precise sum of per-bp
   sap_floor_dimensions[*].total_floor_area rather than the lodged
   scalar. The synthetic 21_0_1.json fixture has lodged total_floor_
   area=55 + a single fd of 45.82 (per-bp sum doesn't match lodged).
   Updated the expected to 45.82 with a comment explaining the
   Slice 95 per-bp-sum precedence.

3. test_elmhurst_end_to_end.py::test_emitter_temperature: expected
   "Unknown", got int 1. Pre-existing failure (confirmed by checking
   out commit 985a59e1 and reproducing). `_elmhurst_emitter_
   temperature_int` in datatypes/epc/domain/mapper.py converts the
   Elmhurst Summary §14 "Design flow temperature: Unknown" to SAP10.2
   Table 4d code 1 (high-temp / ≥45 °C, worst-case for unmeasured
   boilers). The int encoding mirrors the API mapper's MainHeating
   Detail.emitter_temperature for cross-mapper field parity. Test
   updated to expect 1 (with comment) since the conversion is the
   correct production behaviour.

Verified:
- Layer 4 1e-4 gate (test_api_001479_full_chain_sap_matches_worksheet_
  pdf_exactly) still GREEN.
- Wider domain sweep (domain/sap10_calculator + domain/sap10_ml):
  1654 passed / 20 failed, exact pre-fix baseline.
- All three originally-failing tests now PASS.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 13:34:51 +00:00
Khalim Conn-Kowlessar
68401c517a refactor: lift-and-shift packages/domain/src/domain/ml → domain/sap10_ml
Sibling migration to the sap10_calculator move — `domain.ml` now lives
at the root-level layout (`domain/sap10_ml/`) matching the pattern
already used by `domain.addresses`, `domain.tasks`, `domain.postcode`,
and `domain.sap10_calculator`.

Changes:

- `git mv packages/domain/src/domain/ml → domain/sap10_ml` (19 files;
  history preserved).
- Subpackage rename: `domain.ml` → `domain.sap10_ml`. 32 references
  rewritten across .py and .md files: 11 internal + 21 external
  (datatypes/epc/domain/mapper.py, 14 files in domain/sap10_calculator,
  2 backend tests, 2 ADRs, 1 README, 1 design doc).
- Path-string updates: `pytest.ini` testpath
  `packages/domain/src/domain/ml/tests` → `domain/sap10_ml/tests` so
  ML tests stay in the default auto-discovered sweep. `CONTEXT.md`
  also updated.

`packages/domain/src/domain/` is now empty — the workspace `domain/`
tree has been fully migrated. Together with the `domain/__init__.py`
deletions from the sap10_calculator commit (29ac35cc), `domain` is
now a single root-level namespace package with subpackages
{addresses, sap10_calculator, sap10_ml, tasks} + the standalone
`postcode.py` module.

Verified:

- Focused sweep (backend mapper-chain + sap10_calculator worksheet
  e2e + golden fixtures): 99 passed / 19 failed — identical baseline.
- Wider sweep (all sap10_calculator + sap10_ml): 1654 passed / 20
  failed (same pre-existing failures).
- domain/sap10_ml/tests: 210/210 PASSED at new path.
- Pyright net-zero: heat_transmission.py 13, cert_to_inputs.py 35,
  mapper.py 33, rdsap_uvalues.py 1 (all unchanged from baseline).

Note: `packages/domain/pyproject.toml` still declares
`packages = ["src/domain"]` for the hatchling wheel — that target
directory is now empty and the wheel build is effectively a no-op.
Retiring the workspace package or repointing the wheel is a follow-up.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 13:01:35 +00:00
Khalim Conn-Kowlessar
cb4e31a135 Slice 51: Summary_000516 chain pins SAP at 1e-4; roof-window separation
Three mapper extensions, validated by 000516 closing to 1e-4:

1. Roof-window separation by U-value threshold. Elmhurst Summary PDFs
   pool roof windows into the §11 vertical-window table with no type
   marker. The U-value is the only reliable signal — vertical glazing
   in the cohort tops out at 2.80 W/m²K, while Table 24 roof windows
   start at 3.0+. `_is_elmhurst_roof_window` filters U > 3.0 into
   `sap_roof_windows`; the rest flow through the `sap_windows` path.

2. Table-24 roof-window U-value lookup. The cohort lodges Manufacturer
   U=3.10 for the 000516 roof window, but the worksheet's (27a) line
   (U_eff=2.99) reverse-engineers to a raw U=3.40 — the RdSAP10
   Table 24 "Double pre 2002" roof-window default. `_elmhurst_roof_
   window_u_value` keyed on glazing-type captures the +0.3 W/m²K step;
   falls back to the lodged U for glazing types not yet in the table.

3. `SapWindow.window_width × window_height = lodged Area` convention.
   The Elmhurst Summary PDF carries lodged W (2 d.p.) × lodged H
   (2 d.p.) AND a precomputed Area (2 d.p., not always equal to
   product after rounding). The cascade reads only the W×H product
   across §3 / §5 / §6, so flattening to `(area, 1.0)` keeps the
   downstream area aligned with the worksheet's rounded value rather
   than reconstructing W×H with its own rounding drift (e.g. 1.22 ×
   1.76 = 2.1472 m² vs lodged 2.15 m²). The existing
   `test_first_window_*` tests pinning literal W/H were updated to
   pin the area product (the cascade-relevant invariant).

Cohort state after this slice:

  000474   0.0000  ✓ Slice 47
  000477  +1.1161     Elmhurst floor_ach quirk
  000480   0.0000  ✓ Slice 50
  000487  +1.1844     extractor still drops most §11 windows
  000490   0.0000  ✓ Slice 49
  000516   0.0000  ✓ THIS SLICE

4/6 closed at 1e-4. 756 tests pass; pyright net-zero (35 baseline).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 21:16:46 +00:00
Khalim Conn-Kowlessar
256a5afee5 Slice 46c: Elmhurst mapper produces calculator-equivalent EpcPropertyData — Summary_000474 SAP within 0.5 of worksheet PDF
The full Summary→ElmhurstSiteNotes→EpcPropertyData→cascade→SAP chain now produces unrounded SAP 62.52 for cert U985-0001-000474 vs the worksheet PDF's 62.2584 — inside the 0.5 tolerance the user accepts on the API-cert residual cohort. The hand-built worksheet-fixture chain matches Elmhurst's unrounded SAP to 4 d.p. (62.2584), so the calculator+cascade are provably equivalent to Elmhurst's calculator; this slice closes the mapper side of the chain.

Mapper changes drop the string-versus-int impedance mismatch that prevented the cascade from consuming Elmhurst-coded values:
- construction_age_band: `_strip_code('B 1900-1929')` → 'B' (was '1900-1929')
- wall_construction: `_elmhurst_wall_construction_int('CA Cavity')` → 4 (was string 'Cavity')
- wall_insulation_type: `'A As Built'` → 4 (was string 'As Built')
- party_wall_construction: same int-mapping treatment
- main_fuel_type: `_elmhurst_main_fuel_int('Mains gas')` → 26 (the Table 12 fuel code; was string)
- heat_emitter_type: `'Radiators'` → 1 (was string)
- main_heating_control: `_elmhurst_sap_control_code('SAP code 2106, ...')` → 2106 (the SAP code int; was the trailing description)
- main_heating_index_number: parsed leading int from `pcdf_boiler_reference` ('16839 Vaillant…' → 16839) + `main_heating_data_source=1` so the PCDB cascade fires
- window orientation: `_elmhurst_orientation_int('North-West')` → 8 (the SAP10 octant; was string — solar gains were dropping to 0 W/m² as a result)

Floor handling also re-aligned with the SAP convention: floors sorted with the lowest as floor=0 (Elmhurst lodges 1st-floor entries first in the PDF); zero-area entries filtered out (single-storey extensions); non-ground room heights get the +0.25 m joist-void adjustment; `is_exposed_floor=True` for ground floors lodged above unheated space ('U Above unheated space'). `total_floor_area_m2` now sums across main + extensions.

Three regression pins on the new path:
- sap_building_parts == 3 (multi-bp)
- sap_windows == 7 (layout-style window parser)
- unrounded SAP within 0.5 of 62.2584 (worksheet PDF line 257)

Existing end-to-end test assertions updated to reflect the spec-correct int codes.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-24 18:32:20 +00:00
Khalim Conn-Kowlessar
883028c89e P6.1 follow-on: unbox BuildingPartIdentifier at backend boundaries
Threads the strict BuildingPartIdentifier type (introduced in a8b443f6)
through the two remaining backend touchpoints:

- EpcBuildingPartModel.from_*: SQLModel column expects a string, so
  unbox the enum with .identifier.value before binding to the DB.
- documents_parser end-to-end tests: swap bare-string equality
  ("main" / "extension_1") for identity checks against the enum
  members (BuildingPartIdentifier.MAIN / EXTENSION_1).

Documents_parser test pack passes (105/105). No dedicated SQLModel test
covers EpcBuildingPartModel.from_*; the .value line is exercised
transitively via db_writer.py / local_runner.py in production.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-20 09:58:23 +00:00
Daniel Roth
51bd18e0d7 Rename window frame material column 🟩 2026-04-27 16:11:32 +00:00
Daniel Roth
9571ed608c map elmhurst window transmission details to epc property data class 🟥 2026-04-27 14:13:02 +00:00
Daniel Roth
00821c5c23 map elmhurst energy fields to epc property data class 🟥 2026-04-27 12:15:28 +00:00
Daniel Roth
b36c8b884c map remaining Elmhurst fields to EpcPropertyData 🟩 2026-04-24 15:33:59 +00:00
Daniel Roth
15ae46ec92 Map Elmhurst site notes to EpcPropertyData 🟥 2026-04-24 13:37:21 +00:00