Replaces the two hardcoded glazing defaults (g⊥=0.63, FF=0.7) in the
cert→inputs mapper with spec-driven lookups:
- g_perpendicular by glazing_type (Table 6b):
single → 0.85, double 2002+ → 0.72, low-E soft → 0.63,
secondary → 0.76, triple → 0.68. Default 0.72 when missing.
- frame_factor by frame_material (Table 6c):
wood/PVC/composite → 0.70, aluminium/steel/metal → 0.83.
Measured values from window_transmission_details / SapWindow.frame_factor
still take precedence. Overshading factor stays at 0.77 ("average") since
RdSAP 10 doesn't lodge a per-window overshading code.
100-cert parity probe:
MAE 5.65 → 5.70 (flat)
exact-match within ±1: 18% → 20%
bias +1.13 → +1.50
Slight bias drift toward over-prediction is expected — bigger solar
gains reduce predicted heating demand. Net: the engine is now more
spec-correct (more exact matches), but composition of errors elsewhere
needs the next slice to bring bias back toward 0.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Maps the Table 9 main_heating_control code to SAP control type 1/2/3:
codes 2101-2104 = type 1, 2105-2109 = type 2, 2110+ = type 3. Default
remains type 2 when code is missing or unrecognised.
Two other fixes tried-and-reverted in this slice based on the 100-cert
parity probe:
- NI-thickness → None (the "wall insulated but thickness unknown,
use 50mm row" path): over-corrected in aggregate because many "NI"
certs are genuinely uninsulated. Reverted to legacy NI→0 with a
note to revisit once wall_insulation_type is used as a stronger
signal.
- boiler-age efficiency rescue (cat 1/2, A-F → 0.74, K-M → 0.85):
same issue — stacked with NI fix it over-shot, on its own it gave
marginal MAE without bias improvement. Dropped pending further
investigation.
100-cert parity probe:
MAE 5.72 → 5.65 (-0.07; control-type-only is a small net win)
RMSE 7.58 → 7.48 (-0.10)
bias +1.20 → +1.13
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Splits the single CalculatorInputs.fuel_unit_cost_gbp_per_kwh into three
end-use lines — space_heating, hot_water, other — to match SAP 10.3 §12
which charges different tariffs per end-use on Economy-7 dwellings.
cert→inputs rule: when sap_main_heating_code is in the electric-storage
(401-409), high-heat-retention storage (421-425), or direct-electric
(191-196) ranges, space heating bills at the 7h-low rate (5.5p/kWh)
while hot water + lighting + pumps stay on standard electricity
(13.19p/kWh). All other fuels use a single rate across all three end-
uses.
100-cert parity probe impact:
MAE 7.53 → 5.72 (-1.81, -24%)
RMSE 11.60 → 7.58 (-4.02, -35%)
worst residual -56 → -25 (Semi-detached bungalow)
within ±10: 85% → 91%
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds services/ml_training_data/src/ml_training_data/sap_parity_probe.py
— samples N certs from the v18a corpus, streams them via BulkZipReader,
runs Sap10Calculator, prints MAE/RMSE/bias + worst-N residuals. Baseline
across 100 certs: MAE 8.41, RMSE 13.98, bias -2.65, 0 errors.
docs/sap-spec/PARITY_FINDINGS.md captures the dominant failure pattern
(flats + bungalows under-predicted, 10 of the worst-15 are flats whose
floor/roof are party with neighbouring dwellings) and the priority-
ordered Session B iteration backlog (S-B-flat-surfaces first).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pure-function ParityCase / ParityReport / build_parity_report for the
Session B 1000-cert parity check (ADR-0009). Aggregates per-cert
(predicted, actual) sap pairs into global + typical-subset MAE, RMSE,
bias, and the worst-N residuals for spec-iteration. Cert→case mapping
(corpus load, calculator run, actual-sap lookup) sits at a higher
layer; this module is trivial to test so the harder integration code
inherits its testing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds domain.sap.rdsap.cert_to_inputs.cert_to_inputs(epc) which produces a
typed CalculatorInputs from an EpcPropertyData, and a thin
Sap10Calculator.calculate(epc) entry point that wraps the mapper + the
S-A7a orchestrator. Defaults follow RdSAP 10 (Table 27 for living-area
fraction, Table 5 for ventilation, Table 12 for fuel cost + CO2 factor)
and SAP 10.3 Tables 4a/4b for heating efficiency via the existing
domain.ml.sap_efficiencies cascade.
Deferred to Session B: conservatory modes, room-in-roof, secondary
heating split (Table 11), multi-fuel weighted cost, thermal-mass
parameter from construction type, control-temp adjustment from
main_heating_control code.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Wires SAP 10.3 §§5-13 into a 12-month heat-balance loop driven by a typed
CalculatorInputs aggregate, returning a typed SapResult with the score,
ECF, costs/CO2 totals, and a 12-entry monthly breakdown. Physics
assembly only — the cert→inputs mapper lands in S-A7b. η/T_internal
solved with two-pass iteration per SAP 10.3 §7.3.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Tenth slice of the SAP10 Calculator Session A (ADR-0009). Ships four
pure functions under domain.sap.worksheet.rating implementing the SAP
10.3 rating formulas:
energy_cost_factor(total_cost_gbp, total_floor_area_m2)
-> equation (7): ECF = 0.36 × cost / (TFA + 45)
Deflator 0.36 sourced from Table 12 (page 191).
sap_rating(ecf)
-> equations (8)/(9), continuous (un-rounded) SAP value:
ECF ≥ 3.5: 108.8 − 120.5 × log10(ECF)
ECF < 3.5: 100 − 16.21 × ECF
Naturally rises above 100 for net energy exporters (negative ECF).
sap_rating_integer(ecf)
-> integer SAP value as published on the EPC: round to nearest, clamp
to minimum 1 per §13.
environmental_impact_rating(co2_emissions_kg_per_yr, total_floor_area_m2)
-> equations (10)-(12), continuous EI rating:
CF = CO2 / (TFA + 45)
CF ≥ 28.3: 200 − 95 × log10(CF)
CF < 28.3: 100 − 1.34 × CF
8 AAA cycles cover: ECF formula hand-computed, SAP linear branch (typical
home), SAP log branch (high cost), boundary continuity at ECF=3.5,
net-exporter SAP > 100, integer rounding + min-1 clamp, EI linear branch,
EI log branch.
Orchestrator (S-A7) wires these into Sap10Calculator alongside the monthly
heat balance loop from S-A5e.
Ninth slice of the SAP10 Calculator Session A (ADR-0009). Ships
monthly_heat_requirement_kwh implementing the Table 9c step-10 formula:
L_m = H × (T_i,m − T_e,m) (W)
Q_heat,m = 0.024 × (L_m − η_m × G_m) × n_m (kWh)
with the table's clamp: Q_heat is set to 0 when negative or below 1 kWh
per month (summer months and well-insulated dwellings in shoulder
months).
The orchestrator (S-A6) iterates utilisation factor + mean internal
temperature until they converge before calling this function.
5 AAA cycles cover: typical-winter-month hand-computed worked example,
summer month with gains exceeding losses clamping to 0, gains-scaling
direction check, external-temperature direction check, and the sub-1-kWh
clamp per the Table 9c note.
Eighth slice of the SAP10 Calculator Session A (ADR-0009). Implements
SAP 10.3 mean internal temperature with three public helpers under
domain.sap.worksheet.mean_internal_temperature:
elsewhere_heating_temperature_c(hlp, control_type)
-> Table 9 T_h2 formula:
control type 1: T_h2 = 21 − 0.5 × HLP
control type 2 or 3: T_h2 = 21 − HLP + HLP² / 12
HLP clamped to 6.0 per Table 9 note (e).
off_period_temperature_reduction_c(t_off, T_h, T_e, R, G, H, η, τ)
-> Table 9b u value (°C drop below T_h over an off-period):
t_c = 4 + 0.25·τ
T_sc = (1−R)(T_h−2) + R·(T_e + η·G/H)
quadratic branch when t_off ≤ t_c, linear when t_off > t_c.
mean_internal_temperature_c(...)
-> Table 9c steps 1-8: living-area zone (off 7+8 h, T_h1=21°C) and
elsewhere zone (off 7+8 h for control 1/2 or 9+8 h for control 3,
T_h2 from above), blended by living_area_fraction, plus the
Table 4e control-type temperature adjustment.
Step 9 (re-compute utilisation factor with the new T_i) and step 10
(Q_heat = 0.024 × (L − η·G) × n_m) live in the next slice's monthly loop.
7 AAA cycles cover: T_h2 formulas for control types 1 vs 2, HLP > 6 clamp
per note (e), off-period u quadratic branch (t_off ≤ t_c), off-period u
linear branch (t_off > t_c), full mean_internal_temperature hand-computed
worked example, and control-type-3 longer first off-period dropping mean
temp slightly below control-type-2.
Seventh slice of the SAP10 Calculator Session A (ADR-0009). Ships
utilisation_factor(*, total_gains_w, heat_loss_rate_w, time_constant_h)
implementing SAP 10.3 Table 9a:
a = 1 + τ / 15
γ = G / L
if γ > 0 and γ ≠ 1: η = (1 − γ^a) / (1 − γ^(a+1))
if γ = 1: η = a / (a + 1)
if heat_loss_rate ≤ 0: η = 1 (dwelling in net surplus)
η caps the contribution of internal + solar gains when they outpace the
heat-loss rate. The orchestrator computes time_constant_h = TMP /
(3.6 × HLP) and passes it in here; that's a future slice.
5 AAA cycles cover: small γ → η ≈ 1, γ = 1 special-case formula,
zero/negative heat loss returning η = 1, large γ dropping η well below
0.5, and higher τ (more thermal mass) raising η for the same γ.
Sixth slice of the SAP10 Calculator Session A (ADR-0009). Two layers
under domain.sap.worksheet.solar_gains:
1. surface_solar_flux_w_per_m2(orientation, pitch_deg, region, month)
— implements Appendix U §U3.2 polynomial that converts the horizontal
solar irradiance from Table U3 to per-orientation per-pitch surface
flux:
S(orient, p, m) = S_h,m × R_h-inc
R_h-inc = A cos²(φ-δ) + B cos(φ-δ) + C
where A, B, C are cubics in sin(p/2) with coefficients k1-k9 from
Table U5. Reads latitude φ from Table U4 and solar declination δ
from Table U3 footer (already in domain.sap.climate.appendix_u).
2. window_solar_gain_w(area_m2, surface_flux, g⊥, FF, Z)
— implements §6.1 equation (5): G = 0.9 × A × S × g⊥ × FF × Z.
Orientation enum maps the 8 SAP cardinal codes to the 5 Table U5 columns:
N/S to their own column; NE/NW share; E/W share; SE/SW share.
7 AAA cycles cover: UK average South vertical July hand-computed flux,
rooflight pitch=0 collapses to horizontal Table U3 directly, North-vertical
summer > winter (diffuse signal), NE/NW share constants symmetry, equation
(5) window gain, zero-area edge case, out-of-range region validation.
Tables 6b (g⊥), 6c (frame factor), 6d (overshading Z) defaults deferred
to the cert→inputs mapper slice — callers pass them explicitly here so
the physics stays cert-shape-independent.
Fourth slice of the SAP10 Calculator Session A (ADR-0009). Ports the
per-element conduction HLC logic out of domain.ml.envelope into a typed
HeatTransmission breakdown under domain.sap.worksheet. Aggregates Σ U×A
across walls, roof, floor, party walls, windows, doors, plus thermal-
bridging y × total exposed area, summed across every building part.
The orchestrator can now read walls_w_per_k / roof_w_per_k / floor_w_per_k
etc. directly off the result for audit + monthly-loop wiring, rather than
seeing a single envelope_heat_loss scalar.
U-value cascade still routes through domain.ml.rdsap_uvalues (migrates to
domain.sap.rdsap.cascade_defaults in Session B per ADR-0009 module-layout
plan). domain.ml.envelope stays in place to keep the ML transform's
physics-feature pipeline running until Session B.
6 AAA cycles cover: per-element breakdown for a baseline age-G cavity
mid-terrace, window net-wall subtraction, insulated-door U-value blending,
cavity-party-wall contribution per Table 15, thermal-bridging scaling by
age band per Table 21, and multi-part (main + extension) aggregation.
192 tests pass across domain.sap + domain.ml — no regressions.
Third slice of the SAP10 Calculator Session A (ADR-0009). Ports the SAP
10.2 / RdSAP10 §4.1 air-change-rate worksheet for the no-pressure-test
path. Returns an InfiltrationBreakdown carrying each named worksheet line
so callers can audit per SAP convention:
(8) openings_ach — Table 2.1 rate × count / volume
(10) additional_ach — (storey_count − 1) × 0.1
(11) structural_ach — 0.25 steel/timber-frame, 0.35 masonry
(12) floor_ach — 0.2 unsealed timber / 0.1 sealed / 0
(13) draught_lobby_ach — 0.05 absent, 0.0 present
(15) window_ach — 0.25 − 0.2 × (pct_dp / 100)
(16) total_ach — sum of all of the above
Table 2.1 rates: open chimney 80, open flue 20, closed-fire chimney 10,
solid-fuel-boiler chimney 20, other-heater chimney 35, blocked chimney
20, intermittent fan 10, passive vent 10, flueless gas fire 40 (all
m³/hour per opening).
9 AAA cycles cover the baseline calculation, each Table 2.1 opening
contribution, frame-vs-masonry structural baseline, suspended-timber
floor sealed/unsealed split, draught-lobby presence, window draught-
proofing scale, multi-opening aggregation, and volume_m3 ≤ 0 validation.
Pressure-test override (worksheet lines 17-21) and mechanical-ventilation
adjustments (Table 4g, n_eff formula §2.6.6) are out of scope for this
slice — separate later slices per ADR-0009.
Second slice of the SAP10 Calculator Session A (ADR-0009). Ships a frozen
Dimensions dataclass + dimensions_from_cert(epc) pure function under
domain/sap/worksheet/. Aggregates geometry across every sap_building_parts
entry (main dwelling + each extension): total floor area, volume, storey
count, area-weighted average storey height, ground/top floor area,
ground-floor heat-loss perimeter, gross wall area, party wall area.
Top-level epc.total_floor_area_m2 is the authoritative TFA; per-storey
sums drive the wall-area calculations. Volume = TFA × avg_storey_height.
5 AAA cycles cover: single-storey single-part, two-storey scaling,
main+extension aggregation, empty-cert fallback to default 2.5 m height,
and a non-default-height terrace exercising party-wall scaling.
Edge cases (porches, conservatories, integral garages, RIR storey
treatment) deferred to later slices per ADR-0009 Session A scope.
First slice of the SAP10 Calculator Session A (ADR-0009). Ships the three
SAP 10.3 Appendix U monthly tables across 22 climate regions (region 0 =
UK average; 1-21 named per spec) as a pure-data module under the new
domain/sap/ package:
- Table U1: mean external temperature (°C)
- Table U2: wind speed (m/s)
- Table U3: mean global solar irradiance on horizontal plane (W/m²)
- Table U3 footer: monthly solar declination (°, region-independent)
Lookups validate region (0..21) and month (1..12) and raise ValueError
on out-of-range inputs. 11 AAA tests cover happy-path lookups across
multiple regions/months plus boundary and error cases.
Promotes ADR-0009 from Proposed to Accepted after the grill-with-docs
session resolved all seven open questions. Bundles the SAP 10.3 and
RdSAP 10 specifications under docs/sap-spec/ plus a calculator design
sketch (module layout, monthly-loop pseudo-code, status table).
CONTEXT.md adds three new domain terms parallel to existing performance
language:
- Calculated SAP10 Performance (parallel to Effective / Lodged)
- SAP10 Calculation (process; implemented by Sap10Calculator)
- Measure Application (process; implemented by MeasureApplicator)
ML pipeline is NOT retired — it stays as the residual head once the
calculator reaches parity in Session B. ADR-0009 §"Grill outcomes" carries
the seven binding scope decisions plus three Session-A-scope changes
discovered during the grill (RdSAP §19 EER formula, SAP 10.2 Appendix A
cross-reference, RdSAP Table 29 cascade defaults).
v20a added ventilation_heat_loss_w_per_k as a standalone feature but never
connected it to the HLC inside predicted_space_heating_kwh, so the
downstream physics aggregates (predicted_ecf, predicted_total_fuel_cost,
predicted_log10_ecf — the top-10 model features) never saw the
infiltration signal. Importance for ventilation_heat_loss_w_per_k was rank
58/196 (importance 30) vs envelope's rank 21 (86).
Adds the ventilation column to the envelope-conduction HLC before
applying HDH and efficiency, so chimney + draught-proofing signals flow
through the physics aggregates the model actually uses. Default 0 keeps
backwards compatibility.
Adds SAP10.2 §C tracer-bullet infiltration model as a new physics-as-feature
column alongside envelope_heat_loss_w_per_k. ACH = structural baseline
(0.35 masonry / 0.25 timber-or-system-built) + open chimneys at 40 m³/h each
minus a draught-proofing reduction scaled by window_pct_draught_proofed,
then volumed and converted to W/K. Targets the d0 catastrophic-low-SAP tail
where chimney + leakage signals dominate but envelope conduction alone
under-counts heat loss.
Scope deferred to follow-ups: MVHR/MEV factors (mechanical_ventilation is
100% null in the corpus), pressure-test override (pressure_test also 100%
null - slice 18e mapper fix), open flues / passive vents / flueless gas
fires (sap_ventilation sparsely populated).
Many real certs carry main_heating_category=4 (heat pump) but null
sap_main_heating_code, so seasonal_efficiency() was returning the 0.80
gas-boiler default — a 3x COP under-count that dragged the high-SAP
heat-pump tail. Adds main_heating_category + main_fuel_type fallbacks:
cat=4 -> 2.30, cat=7 -> 1.00, cat=10 routes by fuel
(electric=1.00, gas=0.55, oil=0.65), cat=5 warm air -> 0.76.
Explicit SAP codes still win.
When wall_construction integer is missing or WALL_UNKNOWN, u_wall now
parses the top-level walls[i].description for material keywords
(sandstone/limestone/granite/whinstone/cob/system built/timber frame/
solid brick/cavity) before falling through to the cavity-by-age default.
Explicit construction codes still win. Threaded through
envelope_heat_loss_w_per_k via a joined wall description string off the
top-level walls list.
Table 18 age-band roof defaults assume joist insulation >= 100mm, which
mis-rates heritage roofs the surveyor explicitly described as
uninsulated. u_roof now reads roofs[i].description and routes
"no insulation" / "uninsulated" -> 2.30 W/m^2K and "limited insulation"
-> 1.50 W/m^2K, threaded through envelope_heat_loss_w_per_k via a single
joined description string off the top-level roofs list.
Explicit insulation_thickness_mm still wins over description.
predicted_total_fuel_cost_gbp was silently mispricing every non-gas
property because primary_main_fuel_type / water_heating_fuel store the
gov EPC API enum (26=mains gas, 27=LPG, 28=oil, 29=electricity) and our
_FUEL_UNIT_PRICE dict is keyed by Table 32 codes (1=gas, 4=oil, 30=elec).
Codes 26-29 hit the dict's default 3.48 p/kWh -- silently treating
electric immersion as gas.
Concrete impact on OX1 5LR Sep 2025 cert (worst-predicted SAP=41, model
84): water_heating_fuel=29 (electric immersion). Real DHW cost 2941 kWh
* 13.19p = £388/yr; we computed 2941 * 3.48 = £102 (4x under). Net
predicted_total_fuel_cost £292 vs implied real £2513 -- predicted_ecf
0.49 (~SAP 93) vs real ECF 4.24 (SAP 41).
Effect: every off-gas property's predicted_ecf was systematically too
low, dragging the model's catastrophic-low-SAP predictions toward
mid-band. Expected to substantially reduce decile-0 bias on retrain.
New _API_TO_TABLE32 map covers codes 0-29. 4 new AAA tests; VERSION
2.2.0 -> 2.3.0 (MINOR; behavioural fix to existing column values).
The 17a-baseline residuals showed cylinder_insulation_thickness_mm,
cylinder_size and cylinder_insulation_type at ranks 3/6/9 for hot_water_kwh
because the crude 16d formula didn't use them -- the model had to learn
storage physics from raw features.
Now predicted_hot_water_kwh sums:
useful_demand (existing, unchanged)
+ distribution_loss = useful * 0.15
+ storage_loss = volume * insulation_factor * 365 * 0.6
(volume from cylinder_size, factor from
cylinder_insulation_thickness_mm or age-default)
+ primary_circuit_loss = 245 (age A-J) / 60 (age K-M)
- wwhrs_credit = useful * 0.12 if number_baths_wwhrs > 0
- solar_hw_credit = 250 if solar_water_heating
all / efficiency_water = delivered kWh
Same inputs we already extract; just plumbed through. Expected:
predicted_hot_water_kwh feature usage jumps from rank 10 to top tier,
hot_water_kwh MAPE drops from 7.17%, and predicted_ecf gets tighter for
gas-heat + electric-DHW mid-band homes -> SAP MAPE marginally better.
5 new AAA tests; VERSION 2.1.0 -> 2.2.0 (MINOR; column semantics enriched).
Closes the high-SAP under-prediction gap diagnosed in 16h. 40% of SAP-85+
properties have PV; predicted_ecf was 1.74 mean at that band -> SAP ~88
via the formula, vs label SAP 90+. Inverse: PV homes had HIGHER predicted_ecf
than non-PV at the same band because cost reconstruction had zero export
credit.
New helper: predicted_pv_generation_kwh(kWp, region) -> kWh/yr from a
SAP10.2 Table 6e regional yield factor (UK avg 850 kWh/kWp/yr; Highland
650; Thames 920).
predicted_total_fuel_cost_gbp now subtracts pv_kwh * standard electricity
price (Table 32 code 30, both self-consumption and export at 13.19 p/kWh).
New feature column predicted_pv_generation_kwh exposed alongside the
adjusted cost so the model sees both signals.
VERSION 2.0.0 -> 2.1.0 (MINOR: column added; existing column semantics
shifted but pre-deploy so no consumer break).
train_baseline now returns mae + rmse alongside mape/smape/r2. MAE is the
user-facing metric ("predicted SAP within N points"); RMSE the quadratic
counterpart. Both come straight from sklearn.
New sample_weight_fn parameter: callable(y_train) -> per-row weights.
Threads into LGBMRegressor.fit's sample_weight argument. Default None
preserves existing behaviour.
Default tail strategy exposed as low_sap_tail_weight(y, threshold=58,
weight=3): 3x weight where SAP < 58. Threshold picked from slice 16h's
per-decile residuals — decile 0 (SAP 1-58) carries 17% MAPE vs <5% body.
Three TDD tracers, all AAA.
250k retrain showed objective='mape' loses ~0.6 percentage points of
global sap_score MAPE (3.92% with regression vs 4.50% with mape) and
~0.7 pts on peui_ucl. The mape objective over-weights the low-SAP tail
(weight ~1/y) and drags the body MAPE up by more than it gains in the
tail.
Body MAPE on v16 features is already strong (2.38% on deciles 1-8); the
remaining tail bias at decile 0 (SAP<58, +3.1 bias) needs a different
fix -- sample weights or stratified loss -- queued as slice 16i.
User reverted the transaction_type drop after noting that it doesn't help
detect full-SAP assessments (that's `assessment_type` on the bulk-register
record, filtered out at build_features.py:37).
tenure removal stays; v2.0.0 still MAJOR (a column was removed).
Neither field physically affects SAP rating; they're dataset-side metadata
(owner-occupied vs rented, sale vs marketed) and any correlation with
sap_score is confounded with age/condition that the model already sees
through built_form / property_type / construction_age_band.
Dropping reduces feature count and removes a source of spurious split-gain.
MAJOR per ADR-0007 versioning policy (column removal): 1.0.0 -> 2.0.0.
Per ADR-0008: the v15 baseline reports MAPE but optimises MSE, which
under-weights tail rows. Switching to objective='mape' applies gradient
proportional to 1/|y| and lets the model focus where MAPE penalises.
Targets co2_emissions, space_heating_kwh, hot_water_kwh, and peui_raw
retain the default 'regression' objective (some rows have ~zero CO2 from
heavy PV; MAPE objective destabilises near zero).
Sample weights deferred to slice 16i if slice 16h's per-decile residuals
still show tail bias after the objective switch.
12 columns renamed; extension_2_* not added (88% null on 250k corpus;
envelope_heat_loss_w_per_k already sums extension_2+ via part-iterator).
ADR-0008.
VERSION 0.4.0 -> 1.0.0 (MAJOR per ADR-0007 versioning policy). Coordinated
cutover with AutoGluon repo + scoring lambda required at deploy time.
features_v16.txt is regenerated from transform.schema() at write-parquet time
(data/ml_training is gitignored; not committed).
ECF reconstruction per SAP10 §20.1 (Mid physics, ADR-0008):
total_cost_gbp = (space_kwh*p_space + dhw_kwh*p_dhw + light_kwh*p_elec) / 100
ECF = 0.42 * total_cost / (TFA + 45)
log10_ecf = log10(ECF) [0 for non-positive]
p_* are Table 32 unit prices via fuel_unit_price_p_per_kwh. Standing
charges deliberately omitted (constant fuel-mix offset; ADR-0008).
predicted_sap_score is NOT emitted as a feature (ADR-0008 Mid not Deep):
the model is left to learn the piecewise log/linear transform from
log10_ecf -> SAP itself, keeping the data layer SAP-version-agnostic.
VERSION 0.3.0 -> 0.4.0 (MINOR).
New module domain.ml.envelope sums Sigma(U*A) + y*A_exposed across every
sap_building_part on a cert. U-values come from rdsap_uvalues' cascade
defaults, so the feature is never null.
Per-part inputs: wall / roof / floor / party-wall / windows / doors.
Windows + doors are apportioned to the main part (first in the list)
per RdSAP10 convention.
Wired into EpcMlTransform.to_row; transform VERSION 0.1.0 -> 0.2.0
(MINOR bump for an additive column per the ADR-0007 policy).
7 envelope unit tests + 2 transform-level tests, all AAA. Reference
geometry: 100 m^2 age-G mid-terrace -> ~208 W/K; doubles for two
storeys; drops with better insulation; sums across extensions.
Adds `_per_decile_residuals` and writes `residuals_<target>.json` next to
metrics.json. Buckets test-set rows by deciles of the true target value;
each bucket carries count + MAPE + MAE + mean residual + true_min/max.
Lets us tell whether errors concentrate in the tails of the true distribution
(e.g. SAP<40 / SAP>85) vs the mid-band — which the global MAPE alone hides.
Baseline for slice 16's MAPE-improvement ablations.
Previously kept the full list of EpcPropertyData in memory before calling
EpcMlTransform.to_rows. For the 25k slice that's ~30 MB; for the 580k
full-2026 corpus it OOM-killed the process silently. Now: parse cert ->
to_row -> append dict -> drop EpcPropertyData reference, so memory is
O(row-dict * n) instead of O(EpcPropertyData * n). Same end-of-frame
post-processing (categorical casts, column-order pin).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
15 new features wired through schema -> domain -> mapper -> transform:
Main Dwelling fabric (11):
- wall_insulation_type, wall_insulation_thickness_mm, wall_dry_lined,
wall_thickness_mm, party_wall_construction
- roof_insulation_location, roof_insulation_thickness_mm
- floor_construction, floor_insulation, floor_insulation_thickness_mm,
floor_heat_loss
Dwelling-level scalars (4):
- multiple_glazed_proportion, number_baths, number_baths_wwhrs,
extract_fans_count
Thickness strings like '50mm'/'NI'/'ND' parsed via _parse_thickness_mm; NI
(no insulation) lands as 0mm so the model sees the physical zero rather than
a missing value. Categorical sentinels ('NA'/'NI'/'ND') become None.
Also fixed long-standing typo `multiple_glazed_propertion` -> `_proportion`
in domain dataclass + its lone DB-model usage.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two production fixes surfaced by the live run:
- mapper.from_rdsap_schema_21_0_1 now sets the three ML target scalars
(energy_rating_current, co2_emissions_current, energy_consumption_current).
They were silently None for every cert before, leaving the only labels as
the kWh fields from renewable_heat_incentive.
- train_baseline coerces object-dtype columns to numeric (None -> NaN) and
drops rows with null target per fit, so LightGBM accepts the frame.
E2E on 500 real certs (~1s):
sap_score R^2=0.604 MAPE=0.084
co2_emissions R^2=0.813 MAPE=0.130
peui_raw R^2=0.979 MAPE=0.026
space_heating_kwh R^2=0.823 MAPE=0.213
hot_water_kwh R^2=0.519 MAPE=0.115
peui_ucl excluded: UCL correction still needs wiring.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Currently fails on SapWindow.glazing_gap (first of ~30 fields the dataclass
incorrectly treats as required). Will go GREEN once 14j sweeps Optional.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Bulk entries are NDJSON of wrapper records, not a JSON array. Each wrapper
carries certificate_number, assessment_type, and a stringified document with
the actual EPC schema payload. Filter to RdSAP, unwrap document, then map.
remote_bulk_fetcher: per-entry presigned-URL refresh (30s S3 TTL).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>