Handover: tighten Summary→SAP chain pin to 1e-4 + brief next agent

Slice 46c left the chain at SAP Δ=0.26 vs the Elmhurst worksheet PDF's 62.2584. The user rejected the 0.5 tolerance: because the cascade reproduces Elmhurst exactly on hand-built inputs and the Summary PDF carries the same source-of-truth data, the mapped path must hit 1e-4 like every other Elmhurst worksheet pin.

This commit:
- Tightens `test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly` from 0.5 to 1e-4. Currently fails with Δ=0.2611 — the forcing function for the next slice.
- Replaces the stale `docs/sap-spec/NEXT_AGENT_PROMPT.md` with a fresh handover identifying the two remaining diffs:
  * pumps_fans_kwh_per_yr 130 vs 160 (30 kWh; likely `central_heating_pump_age` not plumbed)
  * Window [4] mis-classified as SE (4) instead of E (3); `_compose_window_descriptors` over-joins suffix tokens
- Documents the architectural smell (3-schema chain ElmhurstSiteNotes → EpcPropertyData → CalculatorInputs may be over-engineered).
- Lists end-goal: API-path < 0.5 SAP (rounded integers), Elmhurst-path < 1e-4 SAP (unrounded worksheet pins), then replicate for the other 5 Summary PDFs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-05-24 18:43:14 +00:00
parent 256a5afee5
commit b6544e1cd1
2 changed files with 258 additions and 9 deletions

View file

@ -110,14 +110,16 @@ def test_summary_000474_mapper_extracts_seven_windows() -> None:
assert len(epc.sap_windows) == 7
def test_summary_000474_full_chain_sap_within_half_point_of_worksheet_pdf() -> None:
def test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly() -> None:
# Arrange — the full Summary→ElmhurstSiteNotes→EpcPropertyData→cascade
# →SAP path against the U985-0001-000474 worksheet PDF's unrounded
# SAP rating (line 257: SAP value 62.2584, rating (258) = 62).
# The cascade itself matches Elmhurst exactly on hand-built inputs;
# this test pins the mapper end-to-end at the SAP-rating layer so
# any future mapper regression (extractor field drop, code-mapping
# break) surfaces here rather than at the residual-pin layer.
# Because the Summary PDF carries the same source-of-truth data that
# the hand-built worksheet fixture encodes by hand, and because the
# cascade matches Elmhurst's calculator to 4 d.p. on those hand-
# built inputs, this end-to-end path MUST produce the same unrounded
# SAP value. Any non-trivial drift = a real mapper bug dropping
# information from the Summary PDF.
pages = _summary_pdf_to_textract_style_pages(_SUMMARY_000474_PDF)
site_notes = ElmhurstSiteNotesExtractor(pages).extract()
epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes)
@ -127,8 +129,10 @@ def test_summary_000474_full_chain_sap_within_half_point_of_worksheet_pdf() -> N
cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)
)
# Assert — unrounded SAP within 0.5 of the worksheet's 62.2584
# (the same tolerance the user accepted on the API-cert residual
# cohort given the API publishes rounded SAP integers).
# Assert — within the same 1e-4 tolerance the other Elmhurst worksheet
# tests pin against. 0.5 is the API-cert residual tolerance (the API
# publishes rounded SAP integers, so up to half a SAP point is just
# rounding); for Elmhurst worksheet inputs the cascade reproduces
# Elmhurst exactly and we expect identical outputs.
worksheet_unrounded_sap = 62.2584
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 0.5
assert abs(result.sap_score_continuous - worksheet_unrounded_sap) < 1e-4

View file

@ -0,0 +1,245 @@
# Handover — close the Elmhurst Summary→SAP chain to 1e-4
You are picking up branch `ara-backend-design-prd` mid-stream. The
previous agent left a near-complete but **not actually complete**
validation chain. This handover is honest about what's done, what's
still wrong, and why a fresh approach may help.
## The 30-second picture
There are two paths into the calculator:
```
Path A: hand-built EpcPropertyData → cascade → SAP 62.2584 ← matches Elmhurst worksheet PDF to 4 d.p. ✓
Path B: Summary_NNNNNN.pdf → extractor → ElmhurstSiteNotes
→ from_elmhurst_site_notes → EpcPropertyData
→ cascade → SAP 62.5195 ← off by 0.2611 unrounded SAP points ✗
```
Both paths feed the same calculator (`calculate_sap_from_inputs`). Path
A proves the cascade is provably equivalent to Elmhurst's calculator
(`sap_score_continuous` = 62.2584 matches Elmhurst worksheet PDF line
257 exactly). Path B uses `from_elmhurst_site_notes` instead of the
hand-built fixture; **it should produce identical output**, because
the Summary PDF and the hand-built fixture encode the same source-of-
truth data. It doesn't. The 0.26 SAP gap means information is being
dropped in the extractor or mapper.
The user explicitly rejected "within 0.5 is good enough" — the chain
must reproduce Elmhurst to `1e-4` like every other Elmhurst worksheet
test.
The end goal is: `api → EpcPropertyData → Sap10 calculator` with
`< 0.5` SAP error (the API publishes rounded SAP integers, so half a
point is rounding noise). But for the Elmhurst-site-notes path,
because Elmhurst's worksheet PDFs publish unrounded line-ref values,
the target is **zero error to 1e-4**.
## Forcing function
There is one failing test pinning this:
```
backend/documents_parser/tests/test_summary_pdf_mapper_chain.py::
test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly
```
It asserts `abs(mapped_sap - 62.2584) < 1e-4` and currently fails with
`Δ = 0.2611`. Your job is to drive it to GREEN.
## Definition of done
- The failing test above passes at `1e-4` tolerance.
- The two other `test_summary_pdf_mapper_chain.py` tests stay green
(`sap_building_parts == 3`, `sap_windows == 7`).
- The wider `datatypes/epc/` + `backend/documents_parser/tests/`
regression stays green (the 9 pre-existing `test_appendix_u.py`
failures and 1 pre-existing 1e-9 FEE-precision failure are
unrelated — leave them).
- Once 000474 is at 1e-4, replicate the test for the other 5 Summary
PDFs (000477, 000480, 000487, 000490, 000516) — files are under
`sap worksheets/` at the repo root, **untracked**, copy each into
`backend/documents_parser/tests/fixtures/` as you go.
- For each cert, pin against the unrounded SAP value lodged in line
257 of the corresponding `U985-0001-NNNNNN.pdf` worksheet.
## What the previous agent did right
12 commits across the session. The architecturally-load-bearing ones:
| Slice | Commit | Effect |
|---|---|---|
| 44 | `ea6d4263` | flat_roof_insulation_thickness mapper passthrough |
| 45a/b/c | `f08252dc` / `24f35f8b` / `5acbecc5` | PV cascade per Appendix M + U (orientation × pitch × Table-M1 ZPV × rating-vs-demand climate) |
| audit pins | `15789f5a` / `acc6331d` / `8ac548ca` | u_wall / u_roof / u_floor description-cascade pins against Tables 6 / 16 / §5.12 — proves U-value cascade is spec-correct on the cohort |
| 46 scaffold | `ccf7aa21` | First scaffold test for Summary→EpcPropertyData chain (strict-xfail) |
| 46a | `36f2c7bb` | Multi-bp support: schema adds `ExtensionPart`; extractor parses Main + 1st Extension + 2nd Extension subsections in §4/§7/§8/§9 with "As Main: Yes" inheritance; mapper produces a `SapBuildingPart` per bp |
| 46b | `066dce19` | Layout-style window parser anchored on `W H Area` data line + `Manufacturer <U>` line — extracts 7 windows from the Summary table layout |
| **46c** | **`256a5afe`** | **String→int code translations for every Elmhurst-coded field the cascade reads (age band, wall_construction, wall_insulation_type, main_fuel_type, heat_emitter_type, main_heating_control, orientation); PCDB index parsed from `pcdf_boiler_reference`; floor ordering + 0.25 m upper-storey adjustment + `is_exposed_floor` flag for "above unheated space"** |
After 46c, mapped SAP is 62.52 vs target 62.26 (`Δ = 0.26`).
## What the previous agent got wrong
**Overclaimed completion at the 0.5-tolerance milestone.** The
original test was written with a 0.5 tolerance (mirroring the
API-cert residual cohort, where the API publishes rounded SAP
integers so half a point is just rounding). That's the wrong
tolerance for the Elmhurst path: Elmhurst lodges full PDF lines with
4-d.p. unrounded values, and our cascade matches them exactly on
hand-built inputs. The bar is `1e-4`, not 0.5. The previous agent
committed Slice 46c with a 0.5-tolerance pin; this handover has
since tightened it to `1e-4` (the failing test above).
## The remaining diffs (Slice 46c → 1e-4 SAP)
After Slice 46c, the only differing `cert_to_inputs(epc)` scalar
fields between mapped and hand-built are:
```
hot_water_kwh_per_yr: mapped=2291.7821223353485 handbuilt=2291.7784230242883 (Δ < 0.01 float drift)
pumps_fans_kwh_per_yr: mapped=130.0 handbuilt=160.0 ← 30 kWh, real bug
lighting_kwh_per_yr: mapped=139.94522455112704 handbuilt=139.94522455112707 (Δ < 1e-10 float noise)
pumps_fans_primary_factor: mapped=1.5128000000000001 handbuilt=1.5128 (float repr — harmless)
fabric_energy_efficiency_kwh_per_m2_yr: mapped=186.62 handbuilt=186.88 (output, not input — driven by solar gains)
```
Plus a window-orientation mis-classification (see below).
### Diff #1: `pumps_fans_kwh_per_yr` = 130 vs 160 (30 kWh)
This is the dominant residual contributor. Search `cert_to_inputs.py`
for `pumps_fans_kwh_per_yr` to find what drives it. Likely candidates:
- `central_heating_pump_age` (the Summary PDF lodges "Heat pump age:
Unknown" but that's the HEAT pump, not the central heating pump —
may need a separate field on `ElmhurstSiteNotes.MainHeating`)
- Boiler type / FGHRs / weather compensator flags
- Some specific Table 4d/4e cascade input we're dropping
The cascade reads `MainHeatingDetail.central_heating_pump_age: Optional[int]`
which the Elmhurst mapper doesn't currently populate.
### Diff #2: Window [4] orientation mis-classified as SE (4) — should be E (3)
Mapped windows:
```
[4] orient=4 (SE) W=1.1 H=1.6 U=2.0
```
Hand-built has TWO East U=2.0 windows totalling 3.74 m² area. Mapped's
window [2] (East, 1.98 m²) + window [4] (mis-labelled SE, 1.76 m²) =
3.74 m² ✓ — exactly matches. So the layout-style window parser is
producing `orientation='East-South'` for window [4] when it should be
just `'East'`. Look at `_compose_window_descriptors` in
`backend/documents_parser/elmhurst_extractor.py` — the suffix
token "South" is being joined with the inline "East" prefix when it
shouldn't be (probably the "South" belongs to a different window).
The window count itself also differs: mapped extracts 7 individual
windows, hand-built consolidates to 5 by `(orientation, U)` group with
width = total area / 1.0 m. Both should be functionally equivalent
to the cascade IF orientations + U-values + total areas all match. So
fixing the orientation should close this gap.
### Diff #3: float-precision noise
`hot_water_kwh_per_yr` (sub-0.01 kWh) and `lighting_kwh_per_yr`
(sub-1e-10 kWh) are downstream of accumulation order in the cascade.
`pumps_fans_primary_factor` 1.5128000000000001 vs 1.5128 is a Python
float repr quirk. These won't close the SAP gap; ignore them unless
the other fixes leave the test slightly red and one of these turns
out to be the last decimal.
## Why a fresh approach may help
The previous agent's pattern was "fix one bug at a time, ship when
test passes loosely." The right pattern for `1e-4` is the opposite:
**systematically diff every input field between mapped and hand-built,
fix every diff, then run the test once.** That's two more slices
(pumps_fans + window orientation) — not enough for a session of
incremental shipping, so consider doing both in one slice with the
pin as the forcing function.
You may also want to look harder at the *architectural* question:
why does Elmhurst's "site notes" (the surveyor's input form) need to
go through THREE schemas — `ElmhurstSiteNotes`, then `EpcPropertyData`,
then `CalculatorInputs` — when the hand-built fixture skips straight
to the middle one? The string→int translations in `from_elmhurst_site_notes`
are essentially doing what the extractor should do (or what an
`elmhurst_codes.py` codes-module could express). If `ElmhurstSiteNotes`
stored integer codes alongside the human-readable strings, the mapper
would be a pure projection.
## Quick-orient commands
```bash
# Failing test (Δ = 0.26 SAP → target < 1e-4)
python -m pytest backend/documents_parser/tests/test_summary_pdf_mapper_chain.py::test_summary_000474_full_chain_sap_matches_worksheet_pdf_exactly --no-cov --no-header --tb=short
# All Summary→chain tests (2/3 green, 1 failing — the one above)
python -m pytest backend/documents_parser/tests/test_summary_pdf_mapper_chain.py --no-cov --no-header -v
# Wider regression to confirm no fresh breakage
python -m pytest datatypes/epc/ backend/documents_parser/tests/ --no-cov --no-header -q
```
The 9 `test_appendix_u.py` failures and 1 `test_no_ac_cert_round_trips_fee_equals_space_heating_per_m2`
failure are **pre-existing** from before this session — don't try to
fix them as part of this work.
## Reference materials
- **`docs/sap-spec/HANDOVER_NEXT.md`** — original calculator-closure
handover; still useful as the canonical reference for cascade
conventions (AAA tests, 1e-4 tolerance, etc.).
- **`docs/sap-spec/SAP_CALCULATOR.md`** — public API + two-cascade
architecture (rating vs demand).
- **`sap worksheets/Summary_000474.pdf`** (untracked) — the source-of-
truth input for fixture 000474. Mirror tracked at
`backend/documents_parser/tests/fixtures/Summary_000474.pdf`.
- **`sap worksheets/U985-0001-000474.pdf`** (untracked) — the
Elmhurst-computed worksheet with line refs the test pins against
(line 257 for unrounded SAP).
- **`packages/domain/src/domain/sap/worksheet/tests/_elmhurst_worksheet_000474.py`**
— the hand-built `EpcPropertyData` for the same fixture. The CALCULATOR-
EQUIVALENT target the mapper must reproduce.
## File map for the work ahead
| File | Role |
|---|---|
| `backend/documents_parser/elmhurst_extractor.py` | PDF → `ElmhurstSiteNotes` extractor; layout-style window parser at `_compose_window_descriptors` is where diff #2 lives |
| `datatypes/epc/surveys/elmhurst_site_notes.py` | The schema (recently extended with `ExtensionPart`); `MainHeating` may need a `central_heating_pump_age` field for diff #1 |
| `datatypes/epc/domain/mapper.py:254-326` | `from_elmhurst_site_notes` — the mapper itself |
| `datatypes/epc/domain/mapper.py:1772-1830` | Code-translation helpers (`_leading_code`, `_elmhurst_wall_construction_int`, etc.) |
| `datatypes/epc/domain/mapper.py:2010-2080` | `_map_elmhurst_building_part` + extension iteration |
| `datatypes/epc/domain/mapper.py:2180+` | `_map_elmhurst_sap_heating` — likely where diff #1 (pumps_fans) is fixed |
| `packages/domain/src/domain/sap/rdsap/cert_to_inputs.py` | Cascade — search `pumps_fans_kwh_per_yr` for diff #1's root cause |
## Conventions you must honour (from project memory)
- AAA test convention: every new test uses literal `# Arrange / # Act
/ # Assert` headers
- `abs(diff) <= tol` not `pytest.approx` (strict pyright)
- One slice = one commit; stage by name (`?? non_invasive_photos/` and
similar untracked junk must not be staged)
- 1e-4 tolerance, no widening, no xfail (`feedback_zero_error_strict`)
- Strict pyright net-zero on every commit
## Branch state at handover
```
$ git log --oneline -5
256a5afe Slice 46c: Elmhurst mapper produces calculator-equivalent EpcPropertyData — Summary_000474 SAP within 0.5 of worksheet PDF
066dce19 Slice 46b: Elmhurst extractor parses windows from layout-style Summary PDFs
36f2c7bb Slice 46a: Elmhurst mapper handles multi-bp Summary PDFs — Summary_000474 chain test flips green
ccf7aa21 Scaffold: end-to-end Summary→EpcPropertyData chain test for 000474 (xfail)
8ac548ca Audit: pin u_floor §5.12 formula cascade for cert 0240 cohort geometry
```
The 0.5 tolerance in commit 46c's message is stale — this handover
tightened it to 1e-4 after the commit. The first thing you commit
should fix one of the two diffs and explicitly mention closing the
last bit of the gap toward 1e-4.
Good luck.