From 14dc4efeedb6dfee3d9f28d8020d8328d96e42f3 Mon Sep 17 00:00:00 2001 From: Jun-te Kim Date: Tue, 9 Jun 2026 17:07:55 +0000 Subject: [PATCH] save so i can continue --- .../2026-06-09-rdsap-20-0-0-remapper.md | 148 ++++++++++++++++++ 1 file changed, 148 insertions(+) create mode 100644 docs/grill-sessions/2026-06-09-rdsap-20-0-0-remapper.md diff --git a/docs/grill-sessions/2026-06-09-rdsap-20-0-0-remapper.md b/docs/grill-sessions/2026-06-09-rdsap-20-0-0-remapper.md new file mode 100644 index 00000000..79bb6989 --- /dev/null +++ b/docs/grill-sessions/2026-06-09-rdsap-20-0-0-remapper.md @@ -0,0 +1,148 @@ +# Grill session — RdSAP-Schema-20.0.0 → EpcPropertyData remapper + +**Date:** 2026-06-09 · **Branch:** `feature/junte+khalim` · **Status:** paused at Q1 (awaiting answer) + +Resume by re-running `/grill-me` and feeding it this file, or just answer **Q1** below and continue. + +--- + +## Goal + +Make `RdSAP-Schema-20.0.0` certs map cleanly to `EpcPropertyData` via +`EpcPropertyDataMapper`, the same way 21.0.1 already does. Two corpora of 1000 +real certs each: + +- `backend/epc_api/json_samples/RdSAP-Schema-21.0.1/corpus.jsonl` — supported, 1000/1000 pass. +- `backend/epc_api/json_samples/RdSAP-Schema-20.0.0/corpus.jsonl` — target, currently 1000/1000 **fail**. + +User notes: +- The current `RdSapSchema20_0_0` schema was a **placeholder generated from a single EPC JSON example** — that's why it over-constrains. +- "Do it correctly for each one" — map field-by-field correctly, not just make it type-check. + +--- + +## Key code locations + +| Thing | Path | +|-------|------| +| Domain target | `datatypes/epc/domain/epc_property_data.py` (`EpcPropertyData` L607-788; `SapWindow` L255-270) | +| Mapper (dispatcher) | `datatypes/epc/domain/mapper.py` — `from_api_response` L1984; dispatch by `schema_type` | +| 20.0.0 mapper method | `datatypes/epc/domain/mapper.py` `from_rdsap_schema_20_0_0` L1078-1275 | +| 21.0.1 mapper method (reference) | `datatypes/epc/domain/mapper.py` `from_rdsap_schema_21_0_1` L1556-1942 | +| Placeholder schema | `datatypes/epc/schema/rdsap_schema_20_0_0.py` (`RdSapSchema20_0_0` L198-283; `SapWindow` L84-89) | +| Corpus test | `infrastructure/epc_client/tests/test_mapper_corpus.py` (20.0.0 is `xfail`, strict=False) | +| Schema validation helper | `datatypes/epc/schema/helpers.py:36` (raises "missing required field") | + +--- + +## What actually fails today (1000/1000) + +Failures are at the **schema-parse layer** (helpers.py:36), *not* the mapper method. +Placeholder declares fields required that the real corpus often omits: + +| Count | Missing required field | +|------:|------------------------| +| 993 | `sap_windows` | +| 6 | `lzc_energy_sources` | +| 1 | `SapBuildingPart.roof_insulation_thickness` | + +Corpus field-presence (20.0.0, n=1000): `window` 1000, `sap_windows` **7**, +`lzc_energy_sources` **68**. + +--- + +## Central finding — windows + +- 20.0.0 has **no `sap_windows` array**. Windows live as scalar aggregates: + `glazed_area` (enum, e.g. 1), `glazing_gap` ("16+"), `multiple_glazing_type` (3), + `multiple_glazed_proportion` (100), `pvc_window_frames` ("true"), plus the + `window` energy element and cert-level `windows_transmission_details` + (`u_value`, `data_source`, `solar_transmittance`). +- **There is NO RdSAP default-window-area formula anywhere in the codebase.** + SAP window area comes entirely from per-window `width × height` in `sap_windows` + (`cert_to_inputs.py:3787` → `_window_total_area_and_avg_u` L1480). 21.0.1 certs + supply that geometry; 20.0.0 supplies none (`glazed_area` is an enum, not m²). +- Therefore `sap_windows=[]` **does not crash** but models the dwelling as + **windowless**: zero solar gain, zero window heat loss. That's the core gap. + +### Downstream consumers of `sap_windows` (from trace) +- `worksheet/heat_transmission.py` L594-627, L711-731 — per-window U branch if all + windows have `window_transmission_details.u_value`, else aggregate fallback; + iterates width/height/location/wall_type for net wall area. +- `worksheet/solar_gains.py` L389-397 — per-window orientation × area × solar_transmittance. +- `worksheet/internal_gains.py` L617-627 — daylight factor; early-returns 1.433 if no windows + no rooflights. +- `sap10_ml/transform.py` L1629-1707 `_window_aggregates` — tolerant of empty (zero-counts). +- `modelling/scoring/overlay_applicator.py:48` — **indexes `sap_windows[index]`** → IndexError only if an overlay targets a window index that doesn't exist. +- `modelling/generators/glazing_recommendation.py` L87-88 — safe on empty. + +**Load-bearing SapWindow fields:** `window_width`, `window_height`, `orientation`, +`glazing_type`, `window_transmission_details` (conditional), `window_location`, +`window_wall_type`. ML-only: `frame_material`, `frame_factor`, `glazing_gap`, +`draught_proofed`, shutters, `window_type`. + +--- + +## Gap map + +| # | Gap | 20.0.0 reality | Placeholder does now | +|---|-----|----------------|----------------------| +| **A** | Schema over-constrained (1-example origin) | `sap_windows` absent 993/1000; `lzc_energy_sources` absent 932; `roof_insulation_thickness` sometimes absent | declares required → all parse fail | +| **B** | **Windows** | aggregates only (see above), no array | requires `sap_windows` array | +| **C** | Lighting | `fixed_lighting_outlets_count`, `low_energy_fixed_lighting_outlets_count`, `low_energy_lighting` (outlets + low-energy bool) | hardcodes led/cfl/incandescent **bulb** counts = 0 | +| **D** | Orphan counts | has `open_fireplaces_count`, `percent_draughtproofed`; no wet-rooms source | `wet_rooms=0`, `open_chimneys=0`, `draughtproofed_door=None` | +| **E** | Energy scores/costs | present (`energy_rating_current`, costs, co2…) | not mapped at all | + +--- + +## Decision tree / open question queue + +### ▶ Q1 (ROOT, PENDING) — What is the bar for "correct"? Who consumes 20.0.0 output? +Decides the windows strategy (B), which everything hangs off. +- **(a) Full SAP cascade parity** — must run `sap10_calculator` accurately ⇒ must synthesize per-window geometry ⇒ must implement RdSAP standard window-area formula (Table S4: area from TFA + habitable rooms, scaled by `glazed_area` enum). Biggest scope. +- **(b) ML/feature parity** — feeds `sap10_ml/transform.py`; tolerates empty windows, leans on aggregates. Synthesis optional. +- **(c) Type-safe ingest only** — parse without crashing + map every field with a real source; windows empty. + +**Recommendation:** target (a) but **stage it**: first ship schema fix + all +directly-mappable fields (→ 1000/1000 *parsing*), then synthesize a **single +aggregate `SapWindow`** from scalars (area = RdSAP standard formula × `glazed_area` +enum; `glazing_type` ← `multiple_glazing_type`; U ← `windows_transmission_details`; +draught/frame ← `pvc_window_frames`/`glazing_gap`). One window keeps total area + U +correct for heat-transmission and ML; only loses per-orientation solar split, which +20.0.0 cannot provide anyway. + +**Sub-question carried with Q1:** agree that 20.0.0 fundamentally cannot give +per-orientation window detail, so a single aggregate window is the ceiling? + +### Queued (depend on Q1) +- **Q2 (B, if synth):** Where does window AREA come from? RdSAP standard area formula (need to confirm exact RdSAP table/coefficients) × `glazed_area` enum interpretation (1/2/3 = typical/more/less?). Is there an existing helper, or net-new? +- **Q3 (B):** Map `multiple_glazing_type` (3) → domain `glazing_type` code space — is the code space identical to 21.0.1's per-window `glazing_type`, or does it need translation? +- **Q4 (B):** U-value — attach `windows_transmission_details.u_value` as the synthesized window's `window_transmission_details` (→ per-window branch) or leave None (→ aggregate fallback)? Both should agree for a single window; pick the simpler. +- **Q5 (A):** Schema required→optional policy. Proposal: data-driven — any field present in <100% of corpus becomes `Optional` with sensible default; confirm the threshold + default-vs-None per field. (`sap_windows` → default `[]`; `lzc_energy_sources` → default `[]`; `roof_insulation_thickness` → Optional.) +- **Q6 (C):** Lighting. 20.0.0 gives outlets + low-energy bool/count, not bulb split. Options: (i) leave bulb counts 0 (current), (ii) map low-energy → cfl, rest → incandescent, (iii) add outlet fields to domain. Which? +- **Q7 (D):** `open_fireplaces_count` → `open_chimneys_count` — same concept? `percent_draughtproofed` → derive `draughtproofed_door_count`? `wet_rooms_count` — any source or stays 0? +- **Q8 (E):** Map the energy scores/costs/co2 fields now (they exist in 20.0.0) or defer? Likely just wire them — they're present and 1:1. +- **Q9 (test):** Flip plan — when 1000/1000 pass, drop `xfail` + keep strict guard (per test file comment L8-12). Add field-level assertions beyond `isinstance`? + +--- + +## How to reproduce the failures + +```bash +# bucket the 20.0.0 mapping errors by type +python - <<'EOF' +import json, collections, traceback +from pathlib import Path +from datatypes.epc.domain.mapper import EpcPropertyDataMapper +certs=[json.loads(l) for l in Path("backend/epc_api/json_samples/RdSAP-Schema-20.0.0/corpus.jsonl").read_text().splitlines() if l.strip()] +b=collections.Counter() +for c in certs: + try: EpcPropertyDataMapper.from_api_response(c) + except Exception as e: + tb=traceback.extract_tb(e.__traceback__)[-1] + b[f"{type(e).__name__}: {str(e)[:80]} @ {tb.filename.split('/')[-1]}:{tb.lineno}"]+=1 +for k,v in b.most_common(): print(v,k) +EOF + +# run the corpus test (force xfail to show real failures) +python -m pytest infrastructure/epc_client/tests/test_mapper_corpus.py -k wip_schema_20 --runxfail -q +```