Handover: tighten Summary→SAP chain pin to 1e-4 + brief next agent

Slice 46c left the chain at SAP Δ=0.26 vs the Elmhurst worksheet PDF's 62.2584. The user rejected the 0.5 tolerance: because the cascade reproduces Elmhurst exactly on hand-built inputs and the Summary PDF carries the same source-of-truth data, the mapped path must hit 1e-4 like every other Elmhurst worksheet pin. This commit: - Tightens `test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly` from 0.5 to 1e-4. Currently fails with Δ=0.2611 — the forcing function for the next slice. - Replaces the stale `docs/sap-spec/NEXT_AGENT_PROMPT.md` with a fresh handover identifying the two remaining diffs: * pumps_fans_kwh_per_yr 130 vs 160 (30 kWh; likely `central_heating_pump_age` not plumbed) * Window [4] mis-classified as SE (4) instead of E (3); `_compose_window_descriptors` over-joins suffix tokens - Documents the architectural smell (3-schema chain ElmhurstSiteNotes → EpcPropertyData → CalculatorInputs may be over-engineered). - Lists end-goal: API-path < 0.5 SAP (rounded integers), Elmhurst-path < 1e-4 SAP (unrounded worksheet pins), then replicate for the other 5 Summary PDFs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-07-27 23:35:01 +00:00 · 2026-05-24 18:43:14 +00:00 · 2026-05-24 18:43:14 +00:00 · b6544e1cd1
commit b6544e1cd1
parent 256a5afee5
2 changed files with 258 additions and 9 deletions
--- a/backend/documents_parser/tests/test_summary_pdf_mapper_chain.py
+++ b/backend/documents_parser/tests/test_summary_pdf_mapper_chain.py
@ -110,14 +110,16 @@ def test_summary_000474_mapper_extracts_seven_windows() -> None:
    assert len(epc.sap_windows) == 7


-def test_summary_000474_full_chain_sap_within_half_point_of_worksheet_pdf() -> None:
+def test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
    # Arrange — the full Summary→ElmhurstSiteNotes→EpcPropertyData→cascade
    # →SAP path against the U985-0001-000474 worksheet PDF's unrounded
    # SAP rating (line 257: SAP value 62.2584, rating (258) = 62).
-    # The cascade itself matches Elmhurst exactly on hand-built inputs;
-    # this test pins the mapper end-to-end at the SAP-rating layer so
-    # any future mapper regression (extractor field drop, code-mapping
-    # break) surfaces here rather than at the residual-pin layer.
+    # Because the Summary PDF carries the same source-of-truth data that
+    # the hand-built worksheet fixture encodes by hand, and because the
+    # cascade matches Elmhurst's calculator to 4 d.p. on those hand-
+    # built inputs, this end-to-end path MUST produce the same unrounded
+    # SAP value. Any non-trivial drift = a real mapper bug dropping
+    # information from the Summary PDF.
    pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
    site_notes = ElmhurstSiteNotesExtractor(pages).extract()
    epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
@ -127,8 +129,10 @@ def test_summary_000474_full_chain_sap_within_half_point_of_worksheet_pdf() -> N
        cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
    )

-    # Assert — unrounded SAP within 0.5 of the worksheet's 62.2584
-    # (the same tolerance the user accepted on the API-cert residual
-    # cohort given the API publishes rounded SAP integers).
+    # Assert — within the same 1e-4 tolerance the other Elmhurst worksheet
+    # tests pin against. 0.5 is the API-cert residual tolerance (the API
+    # publishes rounded SAP integers, so up to half a SAP point is just
+    # rounding); for Elmhurst worksheet inputs the cascade reproduces
+    # Elmhurst exactly and we expect identical outputs.
    worksheet_unrounded_sap = 62.2584
-    assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 0.5
+    assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4
--- a/docs/sap-spec/NEXT_AGENT_PROMPT.md
+++ b/docs/sap-spec/NEXT_AGENT_PROMPT.md
@ -0,0 +1,245 @@
+# Handover — close the Elmhurst Summary→SAP chain to 1e-4
+
+You are picking up branch `ara-backend-design-prd` mid-stream. The
+previous agent left a near-complete but **not actually complete**
+validation chain. This handover is honest about what's done, what's
+still wrong, and why a fresh approach may help.
+
+## The 30-second picture
+
+There are two paths into the calculator:
+
+```
+Path A:  hand-built EpcPropertyData  →  cascade  →  SAP 62.2584  ← matches Elmhurst worksheet PDF to 4 d.p.  ✓
+Path B:  Summary_NNNNNN.pdf  →  extractor  →  ElmhurstSiteNotes
+                            →  from_elmhurst_site_notes  →  EpcPropertyData
+                            →  cascade  →  SAP 62.5195  ← off by 0.2611 unrounded SAP points  ✗
+```
+
+Both paths feed the same calculator (`calculate_sap_from_inputs`). Path
+A proves the cascade is provably equivalent to Elmhurst's calculator
+(`sap_score_continuous` = 62.2584 matches Elmhurst worksheet PDF line
+257 exactly). Path B uses `from_elmhurst_site_notes` instead of the
+hand-built fixture; **it should produce identical output**, because
+the Summary PDF and the hand-built fixture encode the same source-of-
+truth data. It doesn't. The 0.26 SAP gap means information is being
+dropped in the extractor or mapper.
+
+The user explicitly rejected "within 0.5 is good enough" — the chain
+must reproduce Elmhurst to `1e-4` like every other Elmhurst worksheet
+test.
+
+The end goal is: `api → EpcPropertyData → Sap10 calculator` with
+`< 0.5` SAP error (the API publishes rounded SAP integers, so half a
+point is rounding noise). But for the Elmhurst-site-notes path,
+because Elmhurst's worksheet PDFs publish unrounded line-ref values,
+the target is **zero error to 1e-4**.
+
+## Forcing function
+
+There is one failing test pinning this:
+
+```
+backend/documents_parser/tests/test_summary_pdf_mapper_chain.py::
+  test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly
+```
+
+It asserts `abs(mapped_sap - 62.2584) < 1e-4` and currently fails with
+`Δ = 0.2611`. Your job is to drive it to GREEN.
+
+## Definition of done
+
+- The failing test above passes at `1e-4` tolerance.
+- The two other `test_summary_pdf_mapper_chain.py` tests stay green
+  (`sap_building_parts == 3`, `sap_windows == 7`).
+- The wider `datatypes/epc/` + `backend/documents_parser/tests/`
+  regression stays green (the 9 pre-existing `test_appendix_u.py`
+  failures and 1 pre-existing 1e-9 FEE-precision failure are
+  unrelated — leave them).
+- Once 000474 is at 1e-4, replicate the test for the other 5 Summary
+  PDFs (000477, 000480, 000487, 000490, 000516) — files are under
+  `sap worksheets/` at the repo root, **untracked**, copy each into
+  `backend/documents_parser/tests/fixtures/` as you go.
+- For each cert, pin against the unrounded SAP value lodged in line
+  257 of the corresponding `U985-0001-NNNNNN.pdf` worksheet.
+
+## What the previous agent did right
+
+12 commits across the session. The architecturally-load-bearing ones:
+
+| Slice | Commit | Effect |
+|---|---|---|
+| 44 | `ea6d4263` | flat_roof_insulation_thickness mapper passthrough |
+| 45a/b/c | `f08252dc` / `24f35f8b` / `5acbecc5` | PV cascade per Appendix M + U (orientation × pitch × Table-M1 ZPV × rating-vs-demand climate) |
+| audit pins | `15789f5a` / `acc6331d` / `8ac548ca` | u_wall / u_roof / u_floor description-cascade pins against Tables 6 / 16 / §5.12 — proves U-value cascade is spec-correct on the cohort |
+| 46 scaffold | `ccf7aa21` | First scaffold test for Summary→EpcPropertyData chain (strict-xfail) |
+| 46a | `36f2c7bb` | Multi-bp support: schema adds `ExtensionPart`; extractor parses Main + 1st Extension + 2nd Extension subsections in §4/§7/§8/§9 with "As Main: Yes" inheritance; mapper produces a `SapBuildingPart` per bp |
+| 46b | `066dce19` | Layout-style window parser anchored on `W H Area` data line + `Manufacturer <U>` line — extracts 7 windows from the Summary table layout |
+| **46c** | **`256a5afe`** | **String→int code translations for every Elmhurst-coded field the cascade reads (age band, wall_construction, wall_insulation_type, main_fuel_type, heat_emitter_type, main_heating_control, orientation); PCDB index parsed from `pcdf_boiler_reference`; floor ordering + 0.25 m upper-storey adjustment + `is_exposed_floor` flag for "above unheated space"** |
+
+After 46c, mapped SAP is 62.52 vs target 62.26 (`Δ = 0.26`).
+
+## What the previous agent got wrong
+
+**Overclaimed completion at the 0.5-tolerance milestone.** The
+original test was written with a 0.5 tolerance (mirroring the
+API-cert residual cohort, where the API publishes rounded SAP
+integers so half a point is just rounding). That's the wrong
+tolerance for the Elmhurst path: Elmhurst lodges full PDF lines with
+4-d.p. unrounded values, and our cascade matches them exactly on
+hand-built inputs. The bar is `1e-4`, not 0.5. The previous agent
+committed Slice 46c with a 0.5-tolerance pin; this handover has
+since tightened it to `1e-4` (the failing test above).
+
+## The remaining diffs (Slice 46c → 1e-4 SAP)
+
+After Slice 46c, the only differing `cert_to_inputs(epc)` scalar
+fields between mapped and hand-built are:
+
+```
+hot_water_kwh_per_yr:      mapped=2291.7821223353485   handbuilt=2291.7784230242883   (Δ < 0.01 — float drift)
+pumps_fans_kwh_per_yr:     mapped=130.0                handbuilt=160.0                ← 30 kWh, real bug
+lighting_kwh_per_yr:       mapped=139.94522455112704   handbuilt=139.94522455112707   (Δ < 1e-10 — float noise)
+pumps_fans_primary_factor: mapped=1.5128000000000001   handbuilt=1.5128               (float repr — harmless)
+fabric_energy_efficiency_kwh_per_m2_yr: mapped=186.62  handbuilt=186.88               (output, not input — driven by solar gains)
+```
+
+Plus a window-orientation mis-classification (see below).
+
+### Diff #1: `pumps_fans_kwh_per_yr` = 130 vs 160 (30 kWh)
+
+This is the dominant residual contributor. Search `cert_to_inputs.py`
+for `pumps_fans_kwh_per_yr` to find what drives it. Likely candidates:
+- `central_heating_pump_age` (the Summary PDF lodges "Heat pump age:
+  Unknown" but that's the HEAT pump, not the central heating pump —
+  may need a separate field on `ElmhurstSiteNotes.MainHeating`)
+- Boiler type / FGHRs / weather compensator flags
+- Some specific Table 4d/4e cascade input we're dropping
+
+The cascade reads `MainHeatingDetail.central_heating_pump_age: Optional[int]`
+which the Elmhurst mapper doesn't currently populate.
+
+### Diff #2: Window [4] orientation mis-classified as SE (4) — should be E (3)
+
+Mapped windows:
+```
+  [4] orient=4 (SE)  W=1.1  H=1.6  U=2.0
+```
+
+Hand-built has TWO East U=2.0 windows totalling 3.74 m² area. Mapped's
+window [2] (East, 1.98 m²) + window [4] (mis-labelled SE, 1.76 m²) =
+3.74 m² ✓ — exactly matches. So the layout-style window parser is
+producing `orientation='East-South'` for window [4] when it should be
+just `'East'`. Look at `_compose_window_descriptors` in
+`backend/documents_parser/elmhurst_extractor.py` — the suffix
+token "South" is being joined with the inline "East" prefix when it
+shouldn't be (probably the "South" belongs to a different window).
+
+The window count itself also differs: mapped extracts 7 individual
+windows, hand-built consolidates to 5 by `(orientation, U)` group with
+width = total area / 1.0 m. Both should be functionally equivalent
+to the cascade IF orientations + U-values + total areas all match. So
+fixing the orientation should close this gap.
+
+### Diff #3: float-precision noise
+
+`hot_water_kwh_per_yr` (sub-0.01 kWh) and `lighting_kwh_per_yr`
+(sub-1e-10 kWh) are downstream of accumulation order in the cascade.
+`pumps_fans_primary_factor` 1.5128000000000001 vs 1.5128 is a Python
+float repr quirk. These won't close the SAP gap; ignore them unless
+the other fixes leave the test slightly red and one of these turns
+out to be the last decimal.
+
+## Why a fresh approach may help
+
+The previous agent's pattern was "fix one bug at a time, ship when
+test passes loosely." The right pattern for `1e-4` is the opposite:
+**systematically diff every input field between mapped and hand-built,
+fix every diff, then run the test once.** That's two more slices
+(pumps_fans + window orientation) — not enough for a session of
+incremental shipping, so consider doing both in one slice with the
+pin as the forcing function.
+
+You may also want to look harder at the *architectural* question:
+why does Elmhurst's "site notes" (the surveyor's input form) need to
+go through THREE schemas — `ElmhurstSiteNotes`, then `EpcPropertyData`,
+then `CalculatorInputs` — when the hand-built fixture skips straight
+to the middle one? The string→int translations in `from_elmhurst_site_notes`
+are essentially doing what the extractor should do (or what an
+`elmhurst_codes.py` codes-module could express). If `ElmhurstSiteNotes`
+stored integer codes alongside the human-readable strings, the mapper
+would be a pure projection.
+
+## Quick-orient commands
+
+```bash
+# Failing test (Δ = 0.26 SAP → target < 1e-4)
+python -m pytest backend/documents_parser/tests/test_summary_pdf_mapper_chain.py::test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly --no-cov --no-header --tb=short
+
+# All Summary→chain tests (2/3 green, 1 failing — the one above)
+python -m pytest backend/documents_parser/tests/test_summary_pdf_mapper_chain.py --no-cov --no-header -v
+
+# Wider regression to confirm no fresh breakage
+python -m pytest datatypes/epc/ backend/documents_parser/tests/ --no-cov --no-header -q
+```
+
+The 9 `test_appendix_u.py` failures and 1 `test_no_ac_cert_round_trips_fee_equals_space_heating_per_m2`
+failure are **pre-existing** from before this session — don't try to
+fix them as part of this work.
+
+## Reference materials
+
+- **`docs/sap-spec/HANDOVER_NEXT.md`** — original calculator-closure
+  handover; still useful as the canonical reference for cascade
+  conventions (AAA tests, 1e-4 tolerance, etc.).
+- **`docs/sap-spec/SAP_CALCULATOR.md`** — public API + two-cascade
+  architecture (rating vs demand).
+- **`sap worksheets/Summary_000474.pdf`** (untracked) — the source-of-
+  truth input for fixture 000474. Mirror tracked at
+  `backend/documents_parser/tests/fixtures/Summary_000474.pdf`.
+- **`sap worksheets/U985-0001-000474.pdf`** (untracked) — the
+  Elmhurst-computed worksheet with line refs the test pins against
+  (line 257 for unrounded SAP).
+- **`packages/domain/src/domain/sap/worksheet/tests/_elmhurst_worksheet_000474.py`**
+  — the hand-built `EpcPropertyData` for the same fixture. The CALCULATOR-
+  EQUIVALENT target the mapper must reproduce.
+
+## File map for the work ahead
+
+| File | Role |
+|---|---|
+| `backend/documents_parser/elmhurst_extractor.py` | PDF → `ElmhurstSiteNotes` extractor; layout-style window parser at `_compose_window_descriptors` is where diff #2 lives |
+| `datatypes/epc/surveys/elmhurst_site_notes.py` | The schema (recently extended with `ExtensionPart`); `MainHeating` may need a `central_heating_pump_age` field for diff #1 |
+| `datatypes/epc/domain/mapper.py:254-326` | `from_elmhurst_site_notes` — the mapper itself |
+| `datatypes/epc/domain/mapper.py:1772-1830` | Code-translation helpers (`_leading_code`, `_elmhurst_wall_construction_int`, etc.) |
+| `datatypes/epc/domain/mapper.py:2010-2080` | `_map_elmhurst_building_part` + extension iteration |
+| `datatypes/epc/domain/mapper.py:2180+` | `_map_elmhurst_sap_heating` — likely where diff #1 (pumps_fans) is fixed |
+| `packages/domain/src/domain/sap/rdsap/cert_to_inputs.py` | Cascade — search `pumps_fans_kwh_per_yr` for diff #1's root cause |
+
+## Conventions you must honour (from project memory)
+
+- AAA test convention: every new test uses literal `# Arrange / # Act
+  / # Assert` headers
+- `abs(diff) <= tol` not `pytest.approx` (strict pyright)
+- One slice = one commit; stage by name (`?? non_invasive_photos/` and
+  similar untracked junk must not be staged)
+- 1e-4 tolerance, no widening, no xfail (`feedback_zero_error_strict`)
+- Strict pyright net-zero on every commit
+
+## Branch state at handover
+
+```
+$ git log --oneline -5
+256a5afe Slice 46c: Elmhurst mapper produces calculator-equivalent EpcPropertyData — Summary_000474 SAP within 0.5 of worksheet PDF
+066dce19 Slice 46b: Elmhurst extractor parses windows from layout-style Summary PDFs
+36f2c7bb Slice 46a: Elmhurst mapper handles multi-bp Summary PDFs — Summary_000474 chain test flips green
+ccf7aa21 Scaffold: end-to-end Summary→EpcPropertyData chain test for 000474 (xfail)
+8ac548ca Audit: pin u_floor §5.12 formula cascade for cert 0240 cohort geometry
+```
+
+The 0.5 tolerance in commit 46c's message is stale — this handover
+tightened it to 1e-4 after the commit. The first thing you commit
+should fix one of the two diffs and explicitly mention closing the
+last bit of the gap toward 1e-4.
+
+Good luck.