docs: session-9 close-out + session-10 handover (summary-report-based audit)

Session 9 ran five independent data-driven audits (profiler, dropped-field scan,
CO2/PE reconciliation, cross-provider LIG parity, HW-demand reconciliation) — all
converged on diffuse remaining gap — and shipped glazing Table-24 (+16 certs) +
HW-only heat-network DLF, taking 54.90% -> 56.8% within-0.5. The data-driven seam
is exhausted; session 10 switches to worksheet-level ground truth via the
summary-report-based per-cert audit. New agent prompt at HANDOVER_SUMMARY_AUDIT.md
with method, starter candidate certs, ruled-out list, and conventions.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-06-09 14:54:08 +00:00
parent 872bc585f7
commit 590cb97ef6
2 changed files with 151 additions and 0 deletions

View file

@ -1,5 +1,11 @@
# Handover — API SAP accuracy (session 3): raises cleared, now profile-driven
> **➡️ SESSION 10 STARTS HERE: `docs/HANDOVER_SUMMARY_AUDIT.md`.** HEAD `872bc585`, **56.8%
> within-0.5** (909 computed / 0 raises). Session 9 ran FIVE data-driven audit angles — all
> converged on "remaining gap is diffuse" — and shipped the glazing Table-24 win (+16 certs) +
> HW-only heat-network DLF. The data-driven seam is mined out; **session 10 switches to the
> summary-report-based per-cert worksheet audit.** Read that doc first.
**Branch:** `feature/per-cert-mapper-validation` (long-lived working branch — **NEVER PR to
main**; the user pushes/PRs when ready). **HEAD `a8e5563a`+** (the profiler commit), local-only
ahead of origin.

View file

@ -0,0 +1,145 @@
# Handover — API SAP accuracy, SESSION 10: summary-report-based per-cert audit
You're continuing API→SAP accuracy work on branch **`feature/per-cert-mapper-validation`** in
`/workspaces/model`, **HEAD `872bc585`**. This is a **long-lived working branch — NEVER PR to
main**; the user pushes/PRs when ready. 31 commits ahead of `origin`, unpushed.
## THE GOAL (measurable, unchanged)
100% of API records with a lodged SAP compute **within 0.5 SAP** of the API's
`energy_rating_current`. Headline gauge:
```
PYTHONPATH=/workspaces/model python scripts/eval_api_sap_accuracy.py
```
**Current: 56.8% within-0.5** (within-1.0 72.2%, within-2.0 84.8%, mean|err| 1.197, median 0.438,
signed 0.229, **909 computed / 0 raises**, 100 unsupported_schema). Writes `_results.csv` to the
cache. Re-profile with `scripts/profile_api_error.py --min-n 12`; component decomposition with
`scripts/decompose_api_cost_error.py`. ~1009 cached API JSONs at `/tmp/epc_2026_sample`
(`EPC_SAMPLE_CACHE` overrides).
## ⚠️ THE PIVOT — why this session is DIFFERENT
The previous session (9) ran **FIVE independent data-driven audit angles**, all of which converged
on the same conclusion — the clean systematic levers are harvested and the remaining gap is
**diffuse** (data-fidelity matching the reference software, data-composition, per-cert scatter):
1. **Error-bucket profiler** → scatter, no clean residual bias.
2. **Dropped-field scan** (raw-JSON field present but mapped-None) → every field plumbed.
3. **CO2/PE reconciliation** vs lodged → systematic +15% but it's a **factor-basis** difference
(our SAP 10.2 Table 12 vs the lodged EPC's published basis), NOT demand (cost-SAP matches),
NOT scope (our CO2/PE correctly exclude appliances/cooking per spec line 326). **Off-goal**
CO2/PE don't feed the cost-based SAP rating.
4. **Cross-provider parity** (LIG-21.0 vs 21.0.1, same builder): LIG under-rates 0.59, cleanest in
cavity (LIG 0.45 vs standard 0.01) — but the wall U computes CORRECTLY; the cause is diffuse
(composition: more solid-brick/system-built/non-PCDB mains, all under-rating in BOTH datasets;
plus per-cert scatter). No recoverable LIG-specific mapping bug.
5. **HW-demand reconciliation** vs lodged HW cost → median residual ≈ £0, well-calibrated. The
high-HW/m² certs are small flats (SAP HW floor effect) and are ACCURATE.
**The data-driven seam is mined out.** The user (correctly — they drove the glazing find) wants to
switch to **worksheet-level ground truth**: the **summary-report-based audit**. Do NOT re-run the
five angles above expecting a new clean bug; pursue per-cert worksheet pins instead.
## THE METHOD — summary-report-based audit (this session's loop)
For a chosen cert, the **user generates two Elmhurst worksheets from the cert's OWN API JSON**
(`/tmp/epc_2026_sample/<cert>.json`): the **P960** (full SAP worksheet, line refs `(1a)..(486)`)
and the **Summary**. Your loop:
1. **Describe the cert field-by-field FIRST** (so the user can reproduce it in Elmhurst): dwelling
type, TFA, age band, every building part (wall/roof/floor construction + insulation + thickness),
windows, the heating system (sap_main_heating_code, category, control, emitter, fuel, PCDB index),
water heating (whc, fuel, cylinder), ventilation, PV. Use the mapper to dump the *mapped*
`EpcPropertyData` so the description matches what we actually compute on.
2. **Pin the cascade to the worksheet line refs at abs=1e-4**`heat_transmission_section_from_cert`
for §3 (26)..(37), the water-heating/§4, §9a/§10a etc. Localise the divergence to a specific
line ref → extractor / mapper / calculator gap.
3. **VALIDATE BEHAVIOUR against the LODGED SAP, not blindly against the user's repro.** The user's
Elmhurst repros are APPROXIMATE (they often pick a slightly-wrong system / inputs). Confirm the
repro's continuous SAP ≈ the lodged `energy_rating_current` BEFORE trusting its line refs; if the
repro diverges from lodged, the repro is the problem, not our cascade. (See
`reference_elmhurst_only_test_pattern` + the `_elmhurst_worksheet_000565` prototype for the
mapper-driven cascade-fixture shape.)
4. One confirmed cause = one TDD slice = one commit (conventions below).
### Starter candidate certs (clean gas, single-building-part, schema 21.0.1 — NOT electric-fabric
### tail, NOT LIG, NOT deproven). |err| in 0.76, good worksheet targets:
```
8700-1771-0622-8501-3963 -5.77 gas cat2 whc=903 (electric immersion HW on a gas main — odd)
2135-2729-0509-0142-6226 +5.29 sapcode 119
4700-6865-0122-1501-3963 -5.19 detached gas boiler
0700-6754-0922-3505-3963 +4.35 whc=911 (gas boiler/circulator for water only)
9093-3060-2207-6506-0204 +3.60 sapcode 502 cat9 (WARM AIR main) + whc=950 — see SESSION-9 below
0330-2817-5590-2096-7831 -2.95 gas cat2 mid-terrace
```
The full list (81 clean candidates) regenerates from the profiler/`_results.csv`. The user may pick
their own certs from domain knowledge — let them drive selection; your job is the field-by-field
description + the line-ref pin.
## WHAT SHIPPED IN SESSION 9 (don't redo) — 54.90% → 56.8%
- **`a0432977` glazing single/secondary/triple per RdSAP 10 Table 24 (THE BIG WIN, +16 certs).**
`_API_GLAZING_TYPE_TO_TRANSMISSION` only mapped double-glazing [1,2,3,13,14]; single (5/15, U 4.8),
secondary (4/11/12), triple (6/8/9/10) returned None → silent u_window default U=2.5. Single glazing
at half its real heat loss was the killer. Method that found it: profile `sap_windows[].glazing_type`,
decode vs `epc_codes.csv` `glazed_type`.
- **`872bc585` HW-only heat-network DLF (whc 950/951/952).** The Table 12c distribution loss fired
only for `_is_heat_network_main AND whc∈{901,902,914}`; HW-only heat networks missed it entirely.
Added a whc-gated branch `water_eff = plant_eff / DLF` (RdSAP §10, spec p.36). All 3 corpus whc=950
certs improved in |err|; cert **9093 still +3.60** — its residual is the **warm-air main (sapcode
502, cat 9)**, a SEPARATE cause and a good worksheet candidate.
- **`7878a969` fuel strict-raise** at the Table-12 factor boundary (the cert-8536 collision class).
- **`49fb6c1b` glazing g remap** (codes 4/5 → correct cascade g-slots) — correctness, 0 SAP impact.
- **`a7990edb` ROBUSTNESS GUARDS** (forcing functions): `_api_glazing_transmission` +
`_api_cascade_glazing_type` raise `UnmappedApiCode` on present-but-unmapped glazing;
`seasonal_efficiency` + `water_heating_efficiency` raise `UnmappedSapCode` on present-but-unmapped
codes (was the blind 0.80/0.78 default). **0 current-corpus impact (tables complete) — these are
guards.** KEY for this session: **if a worksheet-audit cert RAISES, that's the guard surfacing a
real gap — map the code.** Also re-verify: efficiency table already covers WHC 908 (multi-point
gas) / 950 (HW heat network) — those are NOT unmapped bugs.
## RULED OUT — do NOT re-chase (verified this session + DEPROVEN list in HANDOVER_API_PROFILING.md)
- **The 100 `unsupported_schema` certs are full-SAP NEW BUILDS** (`assessment_type="SAP"`, mean
rating 86, transaction_type 6 = new dwelling). Structurally different (sap_walls/sap_roofs/
sap_openings with measured U-values, DER, construction_year). **Out of scope for a retrofit
product — do NOT build a parallel pipeline.** They're already excluded from the 56.8%.
- **Solid brick** (gas, 0.52): spec-faithful — `u_wall` applies RdSAP §5.7 Table 13 thickness;
direction wrong for a thickness gap. Data-fidelity (old houses outperform as-built).
- **Roof code-8 sloping-ceiling "insulated"-no-thickness** (cert 7921 23): data-fidelity, we ≡
Elmhurst at uninsulated. **meter_type=3** (Unknown meter): data-fidelity. Orientation code-9 drop:
the East/West "fix" HURTS the gauge; conservatory-only spec rule; leave it.
- LIG-21.0 divergence, CO2/PE +15%, HW-demand over-estimate: all diffuse / off-goal (see THE PIVOT).
## CONVENTIONS (non-negotiable)
One cause = one slice = one commit; **spec citation (page+line) in the message** (the user
explicitly asks us to confirm against the SAP 10.2 / RdSAP 10 PDFs in `domain/sap10_calculator/
docs/specs/` before claiming a fix — see `feedback_spec_citation_in_commits`); AAA test headers
(`# Arrange / # Act / # Assert`); **`abs(x-y)<=tol` not `pytest.approx`** (strict-pyright);
private-symbol test imports single-line with `# pyright: ignore[reportPrivateUsage]`; **SAP 10.2
only** (ignore the 10.3 PDF); no tolerance-widening / xfail; RdSAP is deterministic — every fix is a
spec rule, apply uniformly even when it unmasks offsetting errors, **but flag any within-0.5
regression to the user**; **pyright strict net-zero** (baseline-compare via `git stash`; avoid
`**dict` unpacking into `make_minimal_sap10_epc` — explodes pyright); **stage files BY NAME** (the
tree carries unrelated `scripts/` + `sap worksheets/` changes — never `git add -A`); end commit
messages with `Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>`.
**Regression gate** after any calc/mapper change (goldens esp. 6035 + 000565 are the gate):
```
PYTHONPATH=/workspaces/model python -m pytest tests/domain/sap10_calculator/ \
domain/sap10_ml/tests/ datatypes/epc/ backend/documents_parser/tests/ -q
```
**IGNORE these pre-existing fails** (not yours): `test_total_floor_area`, the 2 stone-wall U tests
in `test_rdsap_uvalues.py`, the flaky `test_other_client_error_propagates` (passes in isolation).
## ARCHITECTURE (quick map)
API path = `EpcPropertyDataMapper.from_api_response(doc)``from_rdsap_schema_21_0_1`
`cert_to_inputs(epc, prices=SAP_10_2_SPEC_PRICES)``calculate_sap_from_inputs(...)`. Fabric U via
`domain/sap10_ml/rdsap_uvalues.py` (`u_wall/u_roof/u_floor/u_window`) feeding
`worksheet/heat_transmission.py` (per-BP loop). HW in `cert_to_inputs` §4 + `worksheet/water_heating.py`.
Efficiency in `domain/sap10_ml/sap_efficiencies.py` (`seasonal_efficiency` / `water_heating_efficiency`,
now strict-raising). Fuel cost/CO2/PE: `tables/table_12.py` + `tables/table_32.py`. SAP equation:
`worksheet/rating.py` (ECF = 0.42·cost/(TFA+45)). The §3 breakdown helper for pins:
`cert_to_inputs.heat_transmission_section_from_cert(epc)``HeatTransmission` (every (26)..(37) line
ref). **KEY INSIGHT: the gov-API JSON is the published OUTPUT of RdSAP software, not its input —
route fields Elmhurst doesn't consume to the spec default.**
## READ ALSO
- `docs/HANDOVER_API_PROFILING.md` — the full SESSION-3..9 log + the load-bearing **DEPROVEN** list.
- Auto-memories: `project_per_cert_mapper_validation_state`, `reference_unmapped_sap_code`,
`reference_unmapped_api_code`, `reference_fuel_code_collision`, `feedback_software_no_special_handling`,
`feedback_spec_citation_in_commits`, `feedback_worksheet_not_api_reference`,
`reference_elmhurst_only_test_pattern`.