save so i can continue

This commit is contained in:
Jun-te Kim 2026-06-09 17:07:55 +00:00
parent 3b7d26fe34
commit 14dc4efeed

View file

@ -0,0 +1,148 @@
# Grill session — RdSAP-Schema-20.0.0 → EpcPropertyData remapper
**Date:** 2026-06-09 · **Branch:** `feature/junte+khalim` · **Status:** paused at Q1 (awaiting answer)
Resume by re-running `/grill-me` and feeding it this file, or just answer **Q1** below and continue.
---
## Goal
Make `RdSAP-Schema-20.0.0` certs map cleanly to `EpcPropertyData` via
`EpcPropertyDataMapper`, the same way 21.0.1 already does. Two corpora of 1000
real certs each:
- `backend/epc_api/json_samples/RdSAP-Schema-21.0.1/corpus.jsonl` — supported, 1000/1000 pass.
- `backend/epc_api/json_samples/RdSAP-Schema-20.0.0/corpus.jsonl` — target, currently 1000/1000 **fail**.
User notes:
- The current `RdSapSchema20_0_0` schema was a **placeholder generated from a single EPC JSON example** — that's why it over-constrains.
- "Do it correctly for each one" — map field-by-field correctly, not just make it type-check.
---
## Key code locations
| Thing | Path |
|-------|------|
| Domain target | `datatypes/epc/domain/epc_property_data.py` (`EpcPropertyData` L607-788; `SapWindow` L255-270) |
| Mapper (dispatcher) | `datatypes/epc/domain/mapper.py``from_api_response` L1984; dispatch by `schema_type` |
| 20.0.0 mapper method | `datatypes/epc/domain/mapper.py` `from_rdsap_schema_20_0_0` L1078-1275 |
| 21.0.1 mapper method (reference) | `datatypes/epc/domain/mapper.py` `from_rdsap_schema_21_0_1` L1556-1942 |
| Placeholder schema | `datatypes/epc/schema/rdsap_schema_20_0_0.py` (`RdSapSchema20_0_0` L198-283; `SapWindow` L84-89) |
| Corpus test | `infrastructure/epc_client/tests/test_mapper_corpus.py` (20.0.0 is `xfail`, strict=False) |
| Schema validation helper | `datatypes/epc/schema/helpers.py:36` (raises "missing required field") |
---
## What actually fails today (1000/1000)
Failures are at the **schema-parse layer** (helpers.py:36), *not* the mapper method.
Placeholder declares fields required that the real corpus often omits:
| Count | Missing required field |
|------:|------------------------|
| 993 | `sap_windows` |
| 6 | `lzc_energy_sources` |
| 1 | `SapBuildingPart.roof_insulation_thickness` |
Corpus field-presence (20.0.0, n=1000): `window` 1000, `sap_windows` **7**,
`lzc_energy_sources` **68**.
---
## Central finding — windows
- 20.0.0 has **no `sap_windows` array**. Windows live as scalar aggregates:
`glazed_area` (enum, e.g. 1), `glazing_gap` ("16+"), `multiple_glazing_type` (3),
`multiple_glazed_proportion` (100), `pvc_window_frames` ("true"), plus the
`window` energy element and cert-level `windows_transmission_details`
(`u_value`, `data_source`, `solar_transmittance`).
- **There is NO RdSAP default-window-area formula anywhere in the codebase.**
SAP window area comes entirely from per-window `width × height` in `sap_windows`
(`cert_to_inputs.py:3787``_window_total_area_and_avg_u` L1480). 21.0.1 certs
supply that geometry; 20.0.0 supplies none (`glazed_area` is an enum, not m²).
- Therefore `sap_windows=[]` **does not crash** but models the dwelling as
**windowless**: zero solar gain, zero window heat loss. That's the core gap.
### Downstream consumers of `sap_windows` (from trace)
- `worksheet/heat_transmission.py` L594-627, L711-731 — per-window U branch if all
windows have `window_transmission_details.u_value`, else aggregate fallback;
iterates width/height/location/wall_type for net wall area.
- `worksheet/solar_gains.py` L389-397 — per-window orientation × area × solar_transmittance.
- `worksheet/internal_gains.py` L617-627 — daylight factor; early-returns 1.433 if no windows + no rooflights.
- `sap10_ml/transform.py` L1629-1707 `_window_aggregates` — tolerant of empty (zero-counts).
- `modelling/scoring/overlay_applicator.py:48`**indexes `sap_windows[index]`** → IndexError only if an overlay targets a window index that doesn't exist.
- `modelling/generators/glazing_recommendation.py` L87-88 — safe on empty.
**Load-bearing SapWindow fields:** `window_width`, `window_height`, `orientation`,
`glazing_type`, `window_transmission_details` (conditional), `window_location`,
`window_wall_type`. ML-only: `frame_material`, `frame_factor`, `glazing_gap`,
`draught_proofed`, shutters, `window_type`.
---
## Gap map
| # | Gap | 20.0.0 reality | Placeholder does now |
|---|-----|----------------|----------------------|
| **A** | Schema over-constrained (1-example origin) | `sap_windows` absent 993/1000; `lzc_energy_sources` absent 932; `roof_insulation_thickness` sometimes absent | declares required → all parse fail |
| **B** | **Windows** | aggregates only (see above), no array | requires `sap_windows` array |
| **C** | Lighting | `fixed_lighting_outlets_count`, `low_energy_fixed_lighting_outlets_count`, `low_energy_lighting` (outlets + low-energy bool) | hardcodes led/cfl/incandescent **bulb** counts = 0 |
| **D** | Orphan counts | has `open_fireplaces_count`, `percent_draughtproofed`; no wet-rooms source | `wet_rooms=0`, `open_chimneys=0`, `draughtproofed_door=None` |
| **E** | Energy scores/costs | present (`energy_rating_current`, costs, co2…) | not mapped at all |
---
## Decision tree / open question queue
### ▶ Q1 (ROOT, PENDING) — What is the bar for "correct"? Who consumes 20.0.0 output?
Decides the windows strategy (B), which everything hangs off.
- **(a) Full SAP cascade parity** — must run `sap10_calculator` accurately ⇒ must synthesize per-window geometry ⇒ must implement RdSAP standard window-area formula (Table S4: area from TFA + habitable rooms, scaled by `glazed_area` enum). Biggest scope.
- **(b) ML/feature parity** — feeds `sap10_ml/transform.py`; tolerates empty windows, leans on aggregates. Synthesis optional.
- **(c) Type-safe ingest only** — parse without crashing + map every field with a real source; windows empty.
**Recommendation:** target (a) but **stage it**: first ship schema fix + all
directly-mappable fields (→ 1000/1000 *parsing*), then synthesize a **single
aggregate `SapWindow`** from scalars (area = RdSAP standard formula × `glazed_area`
enum; `glazing_type``multiple_glazing_type`; U ← `windows_transmission_details`;
draught/frame ← `pvc_window_frames`/`glazing_gap`). One window keeps total area + U
correct for heat-transmission and ML; only loses per-orientation solar split, which
20.0.0 cannot provide anyway.
**Sub-question carried with Q1:** agree that 20.0.0 fundamentally cannot give
per-orientation window detail, so a single aggregate window is the ceiling?
### Queued (depend on Q1)
- **Q2 (B, if synth):** Where does window AREA come from? RdSAP standard area formula (need to confirm exact RdSAP table/coefficients) × `glazed_area` enum interpretation (1/2/3 = typical/more/less?). Is there an existing helper, or net-new?
- **Q3 (B):** Map `multiple_glazing_type` (3) → domain `glazing_type` code space — is the code space identical to 21.0.1's per-window `glazing_type`, or does it need translation?
- **Q4 (B):** U-value — attach `windows_transmission_details.u_value` as the synthesized window's `window_transmission_details` (→ per-window branch) or leave None (→ aggregate fallback)? Both should agree for a single window; pick the simpler.
- **Q5 (A):** Schema required→optional policy. Proposal: data-driven — any field present in <100% of corpus becomes `Optional` with sensible default; confirm the threshold + default-vs-None per field. (`sap_windows` default `[]`; `lzc_energy_sources` default `[]`; `roof_insulation_thickness` Optional.)
- **Q6 (C):** Lighting. 20.0.0 gives outlets + low-energy bool/count, not bulb split. Options: (i) leave bulb counts 0 (current), (ii) map low-energy → cfl, rest → incandescent, (iii) add outlet fields to domain. Which?
- **Q7 (D):** `open_fireplaces_count``open_chimneys_count` — same concept? `percent_draughtproofed` → derive `draughtproofed_door_count`? `wet_rooms_count` — any source or stays 0?
- **Q8 (E):** Map the energy scores/costs/co2 fields now (they exist in 20.0.0) or defer? Likely just wire them — they're present and 1:1.
- **Q9 (test):** Flip plan — when 1000/1000 pass, drop `xfail` + keep strict guard (per test file comment L8-12). Add field-level assertions beyond `isinstance`?
---
## How to reproduce the failures
```bash
# bucket the 20.0.0 mapping errors by type
python - <<'EOF'
import json, collections, traceback
from pathlib import Path
from datatypes.epc.domain.mapper import EpcPropertyDataMapper
certs=[json.loads(l) for l in Path("backend/epc_api/json_samples/RdSAP-Schema-20.0.0/corpus.jsonl").read_text().splitlines() if l.strip()]
b=collections.Counter()
for c in certs:
try: EpcPropertyDataMapper.from_api_response(c)
except Exception as e:
tb=traceback.extract_tb(e.__traceback__)[-1]
b[f"{type(e).__name__}: {str(e)[:80]} @ {tb.filename.split('/')[-1]}:{tb.lineno}"]+=1
for k,v in b.most_common(): print(v,k)
EOF
# run the corpus test (force xfail to show real failures)
python -m pytest infrastructure/epc_client/tests/test_mapper_corpus.py -k wip_schema_20 --runxfail -q
```