Model/docs/HANDOVER_API_PROFILING.md

# Handover — API SAP accuracy (session 3): raises cleared, now profile-driven

**Branch:** `feature/per-cert-mapper-validation` (long-lived working branch — **NEVER PR to
main**; the user pushes/PRs when ready). **HEAD `a8e5563a`+** (the profiler commit), local-only
ahead of origin.

**READ ALSO:** the auto-memory `project_per_cert_mapper_validation_state` (full slice log +
deproven approaches + the meter/shower data-fidelity findings), and the earlier
`docs/HANDOVER_API_ACCURACY_S2.md` (session-2 method).

## THE GOAL (unchanged)
100% of API records with a lodged SAP compute within **0.5 SAP** of the API's
`energy_rating_current`. Headline gauge:
`PYTHONPATH=/workspaces/model python scripts/eval_api_sap_accuracy.py`.

| metric | now (`a8e5563a`) |
|--------|------------------|
| **% \|err\| < 0.5** | **45.1%** |
| % \|err\| < 1.0 | 59.4% |
| mean \|err\| | 1.702 |
| mean signed | −0.006 (balanced) |
| computed / raises | **909 / 0** |
| unsupported_schema | 100 (deferred — see below) |

45% is still poor. The systematic bias is gone; remaining error is per-cert scatter + the
profile-surfaced buckets below.

## WHAT SHIPPED THIS SESSION (7 slices, all green, pyright net-zero)
1. `e41a0bc0` **PCDB heat pump w/o SAP code → Table 12a ASHP_APP_N SH split** (0.80 high-rate).
2. `2bc73fb0` **HP-DHW (WHC 901/902/914 + PCDB HP) → Table 12a WH 0.70 split.** Together (1)+(2)
   killed the cat-4 heat-pump over-rating bias (+1.43 → +0.06).
3. `449d8c5b` **direct-acting electric boiler (191) → zero primary circuit loss** (SAP Table 3
   p.160 zero list names it verbatim).
4. `f4048588` **wall_insulation_thermal_conductivity ignored → §5.8 default λ=0.04.** (See KEY
   INSIGHT below — the gov field is RdSAP *output*, not an input.)
5. `1c5675a0` **floor_heat_loss=8 → no floor heat loss** (extension floor over a heated space;
   RdSAP §3, like code 6).
6. `a8e5563a` **main_heating_category=9 (warm air) → Table 11 secondary fraction 0.10.**
   (4)(5)(6) cleared **all 4 raises** — eval now has zero raises.
7. `(profiler)` **`scripts/profile_api_error.py`** — the new diagnostic (below).

## SESSION-4 UPDATE (HEAD `8741fbdf`) — read before re-working the leads below
- **Lead #1 `floor_codes=3` RESOLVED — the code IS authoritative.** The diagnostic that cracked
  it: join each **single-BP** cert's `floor_heat_loss` code to its independent
  `floors[].description` (the multi-BP tally was contaminated because a cert's `floors[]` summary
  is LOSSY — it drops some BPs' descriptions). Single-BP gives a perfect 1:1 enum: code 1↔"To
  external air"(exposed), 2↔"To unheated space"(semi-exposed), **3↔"(other premises below)"
  (9/9)**, 6↔"(another dwelling below)"(party), 7↔Solid/Suspended(ground). Per RdSAP §3.12
  (p.25) code 3 = "above a partially heated space" (non-domestic premises below) → §5.14 constant
  **U=0.7** (NOT Table-20 semi-exposed, NOT ground). SHIPPED `8741fbdf`.
- **SHIPPED `b40e0f67`:** exposed-floor-on-flats (code 1) area fix — §3.12. A flat's code-1
  floor was area-zeroed by `_dwelling_exposure`; now the per-BP `is_exposed_floor` overrides the
  flat suppression upward (mirrors the "another dwelling below" party override).
- **SHIPPED `8741fbdf`:** code 3 → `is_above_partially_heated_space` (U=0.7) + area override.
  **RE-PINNED golden 7536-3827** — its Ext2(bp3) code-3 floor was mis-read as "ground U=1.12" by
  a prior agent (the lossy floors[] dropped its description), who declared the residual an
  "irreducible register-rounding artifact, DO NOT chase". It was this bug: U 1.12→0.70, PE/CO2
  residuals moved toward 0. **LESSON: "irreducible residual" golden notes are suspect — a real
  mapper bug can hide there.** Eval (both slices): 45.1→45.3%, mean|err| 1.702→1.659, <1.0
  59.5→60.2%. User is generating a fresh `0380-2087-8190-2996-3075` worksheet to independently
  confirm U=0.7 (0380 now −0.63) — validate when it lands.
- **Leads re-checked, NOT clean:** `immersion_type=2` (+1.86) is high-scatter (mean|err| 3.71,
  bidirectional). `main_control=2107` (+1.63) is correctly mapped ("Programmer, TRVs and bypass"
  type 2 Table 4c(2)) — over-rate is diffuse gas-boiler/flat-fabric, not a dispatch bug.
  `roof_codes=1` broad bucket is mean −0.15 (the −1.78 was top-floor-electric-flat outliers
  −29/−25). Remaining gains need per-cert worksheets (start code-3) or the unsupported-schema ticket.

## KEY INSIGHT (load-bearing, from the user)
**The gov EPC API JSON is the published OUTPUT of RdSAP software (Elmhurst), not its input.**
So any API field Elmhurst doesn't expose as an *input* is register metadata the RdSAP10 method
does **not** consume — route it to the spec default, don't try to "use" it. This is exactly why
`wall_insulation_thermal_conductivity` (slice 4) → always λ=0.04. Apply the same lens to any
new "extra" API field before wiring it.

## THE NEW DIAGNOSTIC — `scripts/profile_api_error.py` (run this first)
`PYTHONPATH=/workspaces/model python scripts/profile_api_error.py` joins each computed cert's
signed error with a rich feature set from its **raw API JSON** (not the mapped EpcPropertyData),
and ranks (feature, value) buckets by error carried + by |mean signed| bias. This is how to find
"silly API-path handling" gaps. `--min-n N` sets the bucket floor.

### PRIORITISED LEADS (from the run at `a8e5563a` — verify with the profiler, they'll shift)
Cleanest "API-path handling" candidates first (small, biased buckets = likely a mapper/dispatch
bug, not noise):

1. **`floor_codes=3` → mean signed +5.37 (n=10).** We map API `floor_heat_loss=3` → "To unheated
   space" (same as code 2). The +5.37 over-rate says that's wrong — code 3 likely isn't "unheated
   space" (or its U is wrong). Pull the n=10 certs, check what code 3 really is (ask the user the
   Elmhurst floor dropdown — the API=output lens). **Highest bias, smallest scope = start here.**
2. **Control-code biases:** `main_control=2306` −2.96 (n=11), `2602` +2.49 (n=14), `2107` +1.65
   (n=38), `2402` +1.14 (n=10), `2307` +0.74 (n=11). Several control codes carry systematic bias
   → Table 4c/4e control dispatch gaps. `2107`/`2602` are the biggest. Check
   `_CONTROL_TYPE_BY_CODE` + the Table 4c efficiency-adjustment / Table 4e control coverage.
3. **`immersion_type=2` (dual immersion) → +2.00 (n=43, mean|err| 3.85).** RdSAP §12 lists "dual
   electric immersion" as an off-peak trigger; the cascade does NOT consume `immersion_heating_type`
   for tariff (verified — only comments reference it). Wiring the §12 dual-immersion → off-peak
   rule for Unknown meters is a clean spec slice. (1=single, 2=dual per the Elmhurst Summary.)
4. **`roof_codes=1` −1.78 (n=27)** (flat roof under-rate) and **`roof_insulation_thickness=None`
   −1.18 (n=52)** — flat-roof / no-thickness roof handling.
5. **`main_data_source=2` / `has_pcdb_main=False` → 28% within 0.5, mean|err| 3.17 (n≈242).**
   Non-PCDB heating systems (SAP-table efficiency) are a big under-rating cluster. Likely
   Table 4b default-efficiency or fabric, but worth a look — it's 1/4 of the sample.

### Big scattered segments (need worksheets, NOT clean single fixes)
- **`whc=903` (electric immersion HW): 13% within 0.5, n=84** — looks like the worst bucket but
  it's the electric **storage(cat-7)+room-heater(cat-10)** segment compounding (worst certs span
  −29…+32, bidirectional). Not one bug.
- **`mains_gas=N` (electric): 21% within 0.5, mean|err| 4.27 (n=145)** — the hardest segment;
  per-cert fabric/tariff scatter.
- **Flats (`property_type=2`): 31% within 0.5 (n=283)** — still the worst dwelling type.
- **cat-7 storage (+0.75) / cat-10 room heaters (+0.75)** — both net over-rate; bidirectional.

## DEPROVEN — do NOT retry (empirically failed in earlier sessions; details in memory)
- Routing **roof `'ND'` → Table 18** (description is load-bearing even with 'ND').
- Broad **"all Unknown(meter 3) electric → off-peak"** (over-credits room heaters). NOTE: the
  meter-3 under-rate is partly an **irreducible data-fidelity artifact** — the register stores
  meter_type=3 ("Unknown") on certs whose lodged rating actually used an off-peak meter (cert
  2474: lodged 78 needs 18-hour, but API says Unknown → spec-faithful ~68). Don't chase those to
  the lodged value.
- **RR shell U Table-17-50mm** (golden 6035 disproves it).
- **Shower enum is settled (non-bug):** API `shower_outlet_type` 1=non-electric(mixer)/2=electric
  (cohort 2636/0330 validate at 1e-4); types 3/4/5 are finer gov-output sub-types (type 3 is all
  on unsupported schema 19.1.0; type 4 already accurate). `shower_wwhrs` 1/2/3/4 = none / inst-
  WWHRS-1 / inst-WWHRS-2 / storage. Low headline value — not worth pursuing.

## THE 100 unsupported_schema CERTS (deferred — bigger ticket)
SAP-Schema-19.1.0 (and other pre-21). The user is planning a separate big piece: map old schemas
→ new + **predict missing fields from similar-looking properties** (needs an EPC-prediction
method). That needs its own grilling session — do NOT start it here.

## WORKSHEET WORKFLOW (the user generates them on request)
For per-cert scatter that needs ground truth, ask the user to generate **P960 + Summary**
worksheets from the cert's OWN API JSON (`/tmp/epc_2026_sample/<cert>.json`). **Describe the cert
field-by-field first** (the user reproduces in Elmhurst; their repros are approximate — confirm
SAP matches lodged before pinning). Worksheets land under `sap worksheets/golden fixture
debugging/simulated case NN/` or `sap worksheets/additional with api 2/<cert>/`. Pin the cascade
to the P960 §3/§4/§9a/§10a line refs at abs=1e-4. **Caveat:** the user's repros often diverge
(wrong system / approximate inputs) — validate the BEHAVIOUR (e.g. λ, no-heat-loss) empirically
against the lodged SAP, don't blindly pin to a non-faithful repro.

## TOOLS & CONVENTIONS (non-negotiable)
- `scripts/eval_api_sap_accuracy.py` — headline + TOP-40 + `_results.csv`.
- `scripts/profile_api_error.py` — raw-API characteristic profiling (NEW, run first).
- `scripts/decompose_api_cost_error.py` — per-component cost decomposition (off-peak caveat: uses
  STANDARD elec price, mis-flags off-peak certs).
- ~1009 cached API JSONs at `/tmp/epc_2026_sample` (`EPC_SAMPLE_CACHE` overrides).
- **one cause = one slice = one commit**; **spec citation (page+line)** in the message; AAA test
  headers (`# Arrange/# Act/# Assert`); `abs(x-y)<=tol` not `pytest.approx`; **SAP 10.2 only**;
  **no tolerance-widening / xfail**; RdSAP is **deterministic** — every fix is a spec rule, not a
  population data-fit (the user is firm); pyright strict **net-zero** (baseline-compare via
  `git stash`); **stage files BY NAME** (tree carries unrelated `scripts/` + `sap worksheets/`
  changes — never `git add -A`); `Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>`.
- **REGRESSION after any calc/mapper change:** `tests/domain/sap10_calculator/`,
  `backend/documents_parser/tests/`, `datatypes/epc/`, golden fixtures (esp. **6035**).
- **Pre-existing failures to IGNORE** (fail on the stashed baseline too): `test_total_floor_area`
  and the 2 stone-wall U tests in `domain/sap10_ml/tests/test_rdsap_uvalues.py`.

## ARCHITECTURE NOTES (so you don't re-discover them)
- API path: `EpcPropertyDataMapper.from_api_response(doc)` → `cert_to_inputs(epc, prices=
  SAP_10_2_SPEC_PRICES)` → `calculate_sap_from_inputs(...).sap_score_continuous`.
- Cost path uses `inputs.fuel_cost` (Table-32/12a precompute); `_fuel_cost` returns a ZERO
  sentinel for off-peak → calculator falls back to the legacy scalar `_space_heating_fuel_cost_
  gbp_per_kwh` (which DOES carry the off-peak rate). SapResult fuel codes are RAW API enums —
  translate via `table_12.API_FUEL_TO_TABLE_12`.
- Heating efficiency: `_main_heating_detail_efficiency` → PCDB Table 105 winter eff (if PCDB
  index) else `seasonal_efficiency(code, cat, fuel)` (Table 4a/4b, in `domain/sap10_ml/
  sap_efficiencies.py`). Warm-air Table 4a code→eff map already covers 501-520.
- `sap10_ml/` is marked for eventual migration to `sap10_calculator/` but is still the live
  u-value/efficiency path.