docs: session-4 handover — interlock + secondary fixes, robust-audit method, open leads

Updates the headline (45.1 → 47.6%), records the four shipped fixes + the
roof-8 false-lead closure, documents the two methods that worked
(description-vs-code audit + outlier-robust categorical sweep by net skew +
median), and lists the open robust leads (whc=903 immersion HW, cat-7 storage,
dual immersion) with the scatter buckets to avoid.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
Khalim Conn-Kowlessar 2026-06-08 14:47:31 +00:00
parent faf29942ba
commit d83c431c7d

View file

@ -13,17 +13,20 @@ deproven approaches + the meter/shower data-fidelity findings), and the earlier
`energy_rating_current`. Headline gauge:
`PYTHONPATH=/workspaces/model python scripts/eval_api_sap_accuracy.py`.
| metric | now (`a8e5563a`) |
|--------|------------------|
| **% \|err\| < 0.5** | **45.1%** |
| % \|err\| < 1.0 | 59.4% |
| mean \|err\| | 1.702 |
| mean signed | 0.006 (balanced) |
| computed / raises | **909 / 0** |
| unsupported_schema | 100 (deferred — see below) |
| metric | session-3 (`a8e5563a`) | **session-4 (`faf29942`)** |
|--------|------------------|------------------|
| **% \|err\| < 0.5** | 45.1% | **47.6%** |
| % \|err\| < 1.0 | 59.4% | **62.6%** |
| % \|err\| < 2.0 | 77.7% | **79.6%** |
| mean \|err\| | 1.702 | **1.586** |
| computed / raises | 909 / 0 | **909 / 0** |
| unsupported_schema | 100 (deferred) | 100 (deferred) |
45% is still poor. The systematic bias is gone; remaining error is per-cert scatter + the
profile-surfaced buckets below.
**SESSION-4 shipped (45.1 → 47.6%):** four spec-grounded fixes + closed one false lead.
See the `## SESSION-4 …` blocks below and the auto-memory for full detail. The systematic bias
is gone; the winning method this session was the **description-vs-code audit** + an
**outlier-robust categorical sweep** (rank by net directional skew + MEDIAN, not mean — the
mean-based metric is fooled by multi-cause outliers). 47.6% is still the target's halfway point.
## WHAT SHIPPED THIS SESSION (7 slices, all green, pyright net-zero)
1. `e41a0bc0` **PCDB heat pump w/o SAP code → Table 12a ASHP_APP_N SH split** (0.80 high-rate).
@ -39,7 +42,42 @@ profile-surfaced buckets below.
(4)(5)(6) cleared **all 4 raises** — eval now has zero raises.
7. `(profiler)` **`scripts/profile_api_error.py`** — the new diagnostic (below).
## SESSION-4 UPDATE (HEAD `8741fbdf`) — read before re-working the leads below
## SESSION-4 UPDATE (HEAD `faf29942`) — read before re-working the leads below
### Shipped this session (45.1 → 47.6%)
1. `b40e0f67` **exposed-floor-on-flats** (floor_heat_loss=1) — §3.12; per-BP override of the
dwelling-level flat suppression.
2. `8741fbdf` **floor_heat_loss=3 → above partially heated space, U=0.7** (§3.12/§5.14) +
re-pinned golden 7536 (its "irreducible residual" was THIS bug).
3. `5e7ef5c7` **boiler interlock for TRVs+bypass controls 2107/2111** (§9.4.11) — biggest single
win (+1.6pts). The no-interlock set was keyed off the wrong signal (the "+0.6 °C" annotation);
2107/2111 lack a room thermostat → 5pp + Table 4f ×1.3 pump.
4. `faf29942` **description-lodged secondary heating** (§A.2.2/Table 11) — gas/oil boilers with an
API-description-only secondary ("Portable electric heaters (assumed)", code field None)
dropped the secondary (sec_kWh=0); now `_has_lodged_secondary_description` fires Table 11.
Also added cat-8 (electric underfloor) Table-11 fraction 0.10.
- `560c912c`/`d0f57a0e` docs: **roof_construction=8 lead CLOSED as data-fidelity** (not a bug — see
the roof-8 section below; user worksheet sim-case-29 proved we ≡ Elmhurst).
### The two methods that worked (reuse these)
- **Description-vs-code audit:** join each int code (`floor_heat_loss`, `roof_construction`,
`wall_construction`, secondary type) to its authoritative `…[].description`, **on single-element
certs only** (multi-element `[]` arrays are LOSSY). Mis-maps fall out (floor-3, secondary).
- **Outlier-robust categorical sweep** (`/tmp/cat_audit2.py`): rank field-values by **net
directional skew** (#under0.5 minus #over+0.5) + **MEDIAN** error. The mean-based directionality
metric (`/tmp/cat_audit.py`) gets FOOLED by multi-cause outliers (e.g. "Solid brick no insulation"
looked systematic at mean 1.07 but median is 0.22 = scatter; 2100 61/RR drove it).
### Open robust leads (verify with `/tmp/cat_audit2.py` — they shift; check MEDIAN not mean)
- `whc=903` electric-immersion HW: **median +0.87, n=84** — likely off-peak immersion handling
(the handover noted WHC 903 raises NotImplementedError on the Table-12a off-peak-immersion row).
- `main_heat_cat=7` electric storage: median +1.05, n=41 — over-rate (tariff/cost; partly artifact).
- `immersion_type=2` dual: +1.50, n=43 — we OVER-credit (so §12 dual→off-peak would worsen it).
- `dwelling_type=Top-floor flat`: median 1.24, n=99 — under-rate, mostly fabric scatter/artifacts.
- **Low-dir = SCATTER, do NOT single-fix:** non-PCDB main / data_source=2 (n=242, 28% within-0.5),
mains_gas=N electric (n=145), most flats. These are per-cert/data-fidelity, not one bug.
### Resolved/closed this session (don't re-chase)
- **Lead #1 `floor_codes=3` RESOLVED — the code IS authoritative.** The diagnostic that cracked
it: join each **single-BP** cert's `floor_heat_loss` code to its independent
`floors[].description` (the multi-BP tally was contaminated because a cert's `floors[]` summary