# SAP calculator — agent guide (start here) This is the **canonical onboarding doc** for working on the SAP 10.2 / RdSAP 10 calculator. It is meant to get you productive **without reading any historical handover**. The `HANDOVER_*.md` files in this directory are point-in-time session notes (useful for the specific residual they chase, ignore otherwise). For deep architecture/API see [`SAP_CALCULATOR.md`](SAP_CALCULATOR.md). Three things this doc gives you: (1) the **accuracy bar** for the two input paths, (2) the **debugging loop**, (3) the **tools & pipeline**. --- ## 0. The one-paragraph mental model A cert's data comes in via one of two front-ends — an **Elmhurst Summary PDF** (site-notes path) or an **EPC-register API JSON** (API path). Both map to the same typed `EpcPropertyData`, which feeds a deterministic cascade that reproduces the RdSAP10 engine. Our **ground truth is the Elmhurst worksheet PDF** (U985 / P960 / dr87) — the per-line `(1)..(286)` calculation, not the rounded values the EPC register lodges. We pin the cascade against the worksheet to **abs = 1e-4 on every line ref**. --- ## 1. Accuracy expectations — site-notes vs API The worksheet PDF is **always** the target. The EPC register's lodged SAP/CO2/PE are rounded *and* carry Elmhurst's own residual, so matching the lodged values is not the goal — matching the worksheet is. | Path | Input | When a worksheet PDF exists for the cert | API/site-notes-only (no worksheet) | |---|---|---|---| | **Site-notes** | Elmhurst Summary PDF → extractor → `from_elmhurst_site_notes` | **abs = 1e-4** on continuous SAP **and every populated line ref** and cost / CO2 / PE | n/a (we always have the worksheet for site-notes fixtures) | | **API** | register JSON → `from_api_response` | **abs = 1e-4** on continuous SAP vs the worksheet (same bar as site-notes — the two paths must converge) | **±0.5** SAP vs the lodged register value (fallback only) | Three rules that fall out of this: - **Cross-mapper parity.** For a cert that has both an API JSON and an Elmhurst Summary, the two paths must produce SAP within **1e-4 of each other** *and* of the worksheet. The cascade output (not a structural EPC diff) is the equivalence check. A divergence localises to one mapper. - **No tolerance widening.** A failing 1e-4 pin is a real cascade bug or a fixture defect — diagnose it, don't relax it. No `rel=`, no `xfail`, no adaptive ceilings. ΔSAP = 0.07 is **not** "closed". - **±0.5 is a fallback, not a destination.** It's only for API-only certs with no worksheet to check against. If you can get a worksheet, the bar is 1e-4. Two documented, deliberate exceptions to "match the spec literal" live in [`SAP_CALCULATOR.md` §8](SAP_CALCULATOR.md) ("Elmhurst-mirrored spec divergences") — cases where the BRE-approved Elmhurst engine diverges from the SAP 10.2 text and we mirror the engine. Add a §8 row only with ≥2-cert evidence. --- ## 2. The tools & pipeline ### 2.1 The two PDFs per cert - **`Summary_NNNNNN.pdf`** — the Elmhurst **site notes / input**. This is what the assessor lodged: dimensions, fabric, heating system, controls, cylinder, etc. It is the INPUT, equivalent to the API JSON. - **The worksheet** — the **ground truth output**, every line ref `(1)..(286)` to 4 d.p. Three families, all the same format: - `U985-0001-NNNNNN.pdf` — the 6 gas-combi conformance fixtures. - `P960-0001-NNNNNN.pdf` — the heating-systems corpus + community heating. - `dr87-0001-NNNNNN.pdf` — the API-paired cohort ("Additional data with api"). ### 2.2 The cascade pipeline (site-notes path) ```python import subprocess, re from pathlib import Path from backend.documents_parser.elmhurst_extractor import ElmhurstSiteNotesExtractor from datatypes.epc.domain.mapper import EpcPropertyDataMapper from domain.sap10_calculator.rdsap.cert_to_inputs import ( cert_to_inputs, cert_to_demand_inputs, local_climate_for_cert, ) from domain.sap10_calculator.calculator import calculate_sap_from_inputs # 1. Summary PDF -> per-page text (pdftotext -layout, one string per page) def summary_pdf_to_pages(pdf: Path) -> list[str]: n = int(re.search(r"Pages:\s+(\d+)", subprocess.run(["pdfinfo", str(pdf)], capture_output=True, text=True).stdout).group(1)) pages = [] for i in range(1, n + 1): layout = subprocess.run( ["pdftotext", "-layout", "-f", str(i), "-l", str(i), str(pdf), "-"], capture_output=True, text=True).stdout pages.append("\n".join( tok for line in layout.splitlines() for tok in re.split(r"\s{2,}", line.strip()) if tok)) return pages pages = summary_pdf_to_pages(Path("sap worksheets/.../Summary_NNNNNN.pdf")) site_notes = ElmhurstSiteNotesExtractor(pages).extract() # -> ElmhurstSiteNotes epc = EpcPropertyDataMapper.from_elmhurst_site_notes(site_notes) # -> EpcPropertyData # 2. Two cascades. RATING = SAP/EI rating (UK-avg climate, region 0). # DEMAND = Current Carbon / Current PE / Fuel Bill (postcode climate, PCDB Table 172). rating = calculate_sap_from_inputs(cert_to_inputs(epc)) demand = calculate_sap_from_inputs(cert_to_demand_inputs(epc)) # climate = local_climate_for_cert(epc) rating.sap_score_continuous # un-rounded SAP — pin THIS, not the integer rating.total_fuel_cost_gbp rating.co2_kg_per_yr demand.primary_energy_kwh_per_yr ``` Shortcut: `Sap10Calculator().calculate(epc)` runs the rating cascade (`cert_to_inputs` → `calculate_sap_from_inputs`) in one call. ### 2.3 The API path Identical from `EpcPropertyData` onward — only the front-end changes: ```python import json data = json.loads(Path("tests/domain/sap10_calculator/rdsap/fixtures/golden/.json").read_text()) epc = EpcPropertyDataMapper.from_api_response(data) # -> EpcPropertyData # ... same cert_to_inputs / calculate_sap_from_inputs as above ``` ### 2.4 Section helpers — intermediate line refs Every worksheet section has a `
_section_from_cert(epc)` helper returning a typed result with the line-ref values. Use these to inspect where a residual originates **without** running the whole cascade (`postcode_climate=` selects rating vs demand): ```python from domain.sap10_calculator.rdsap.cert_to_inputs import ( water_heating_section_from_cert, # §4 (42)..(65)m heat_transmission_section_from_cert, # §3 (26)..(37) internal_gains_section_from_cert, # §5 (66)..(73) mean_internal_temperature_section_from_cert, # §7 (85)..(94) space_heating_section_from_cert, # §8 (95)..(99) fuel_cost_section_from_cert, # §10a (240)..(255) environmental_section_from_cert, # §12 (261)..(274) primary_energy_section_from_cert, # §13a (275)..(286) ) wh = water_heating_section_from_cert(epc) wh.energy_content_monthly_kwh # (45)m ; wh.output_kwh_per_yr # (62)/(64) ``` (Full table of helpers + line refs is in [`SAP_CALCULATOR.md` §1.3](SAP_CALCULATOR.md).) ### 2.5 Reading the worksheet from the shell ```bash # Dump a worksheet line ref (e.g. (217)m water-heater monthly efficiency): pdftotext -layout "sap worksheets/.../P960-0001-NNNNNN.pdf" - | grep -nE "\(217\)|\(62\)|\(210\)" # Read a Summary input field (controls, cylinder, fuel): pdftotext -layout "sap worksheets/.../Summary_NNNNNN.pdf" - | grep -niE "cylinder|control|interlock|fuel" ``` ### 2.6 Where the test vectors live | Set | Location | What | |---|---|---| | 6 U985 conformance fixtures | `tests/domain/sap10_calculator/worksheet/_elmhurst_worksheet_NNNNNN.py` (+ Summary PDFs in `backend/documents_parser/tests/fixtures/`) | Gas-combi certs, every line ref transcribed as `LINE_*` / `DEMAND_LINE_*` constants. Pinned in `worksheet/test_section_cascade_pins.py` + `worksheet/test_e2e_elmhurst_sap_score.py`. | | Heating-systems corpus | `sap worksheets/heating systems examples//` (Summary + P960) | 41 variants of **one property** with only the heating system changed → any residual is attributable to the heating subsystem. Pinned in `backend/documents_parser/tests/test_heating_systems_corpus.py`. | | API golden fixtures | `tests/domain/sap10_calculator/rdsap/fixtures/golden/.json` | Register JSON for the API path. | | API + worksheet pairs | `sap worksheets/Additional data with api//` (Summary + dr87) | Certs that have BOTH an API JSON and a worksheet → cross-mapper parity checks. | --- ## 3. The debugging loop When a cert's SAP/cost/CO2/PE is off, **never guess a fix** — walk it. 1. **Reproduce & decompose.** Build the epc (extractor+mapper, or a fixture's `build_epc()`), run both cascades, and see **which of the four outputs** drifts. Cost/CO2/PE drift with the same sign as energy; isolate the carrier. 2. **Find the section.** Walk the four metrics back to a worksheet section: SAP off but cost EXACT often means a demand/gains issue; cost off but energy EXACT means a price/factor issue; CO2/PE off but cost EXACT means a factor issue. Use the §2.4 section helpers to get the cascade's intermediate line refs. 3. **Per-line compare vs the worksheet.** `pdftotext -layout` the worksheet and compare the cascade's `(45)/(56)/(62)/(210)/(217)m/...` line-by-line against the PDF. The first diverging line ref is the bug. 4. **Localise to a layer.** - cascade value present in worksheet but cascade has 0 / wrong → **calculator** gap (a spec rule not wired, or a dispatch gate). - the input field the worksheet used isn't in `epc` → **mapper** (mis-mapped) or **extractor** (didn't capture the Summary field). Audit the Summary PDF for the field first — many lodgements are incomplete and the fixture, not the calculator, is wrong. 5. **Cite the spec.** Find the SAP 10.2 / RdSAP 10 rule (page + line) that produces the worksheet's number. Confirm the worksheet matches the spec literal; if it diverges, it's a candidate §8 Elmhurst-mirror (needs ≥2-cert evidence). **SAP 10.2 only — never 10.3.** 6. **Cross-check vs API (when available).** If the cert has an API JSON too, run `from_api_response` through the same cascade. If the API path matches the worksheet but the site-notes path doesn't (or vice-versa), the bug is in **that mapper**, not the calculator. If both diverge identically, it's the **calculator/cascade**. 7. **Fix one cause, re-pin smaller.** TDD: one failing AAA test → one impl → re-pin the (now smaller) residual. A spec-correct fix often **exposes** the next residual that an offsetting bug was masking — that's the next slice, not a regression. Don't conflate `main_heating_category` (often `None` on Elmhurst Table 4b boilers) with `sap_main_heating_code`. ### Worked shape (real example: oil 6) Residual +3.05 SAP. (1) HW + space both off. (2) §4 HW efficiency. (3) worksheet (210) space eff = 75 but Table 4b code 126 = 80; (217)m summer = 63 = 68−5 → a −5pp penalty. (4) the Summary lodges control `2101` ("no thermostatic control of room temperature") → no room thermostat → P960 header "Boiler Interlock: No". (5) RdSAP 10 §3 + SAP 10.2 Table 4c(2): no room thermostat ⇒ not interlocked ⇒ −5pp Space+DHW. Fix the `no_interlock` gate → space+HW fuel EXACT, residual collapses to a single exposed pump cause (Table 4f footnote a) ×1.3) → next slice. Two slices, fully closed. --- ## 4. Run the suite ```bash PYTHONPATH=/workspaces/model python -m pytest \ tests/domain/sap10_calculator/ \ backend/documents_parser/tests/ \ --no-cov -q -p no:cacheprovider ``` Conformance pins only: ```bash PYTHONPATH=/workspaces/model python -m pytest \ tests/domain/sap10_calculator/worksheet/test_section_cascade_pins.py \ tests/domain/sap10_calculator/worksheet/test_e2e_elmhurst_sap_score.py \ backend/documents_parser/tests/test_heating_systems_corpus.py \ --no-cov -q ``` Notes: - `load_cells` tests pin against the gitignored `*.xlsx` reference worksheet at repo root; they **skip** when it's absent (CI), run locally when present. - All new code passes `pyright` strict, zero errors. Tests use literal `# Arrange / # Act / # Assert` headers and `abs(x - y) <= tol` (not `pytest.approx`, which strict-pyright flags). - Commit one slice per change, with the spec citation in the message. --- ## 5. Spec PDFs on disk ``` domain/sap10_calculator/docs/specs/ sap-10-2-full-specification-2025-03-14.pdf # SAP 10.2 (the methodology) RdSAP 10 Specification 10-06-2025.pdf # RdSAP 10 (the reduced-data rules) pcdb10.dat / pcdb_table_*.jsonl # PCDB (boilers, HPs, postcode weather) ``` Pages worth bookmarking: SAP 10.2 §7 MIT (p.28-32), Table 4b boiler eff (p.168), Table 4c efficiency adjustments (p.169), Table 4e controls (p.171-174), Table 4f auxiliary energy (p.175), Table 12 factors (p.191), Appendix U region tables (p.124-127). RdSAP 10 §10 water heating (p.54-56, incl. §10.7 no-water-heating default), Table 28/29 cylinder defaults, Table 32 prices (p.95). ```