Commit graph

4897 commits

Author SHA1 Message Date
Khalim Conn-Kowlessar
05ef54bb02 restore transaction_type; keep tenure dropped (v2.0.0 stands)
User reverted the transaction_type drop after noting that it doesn't help
detect full-SAP assessments (that's `assessment_type` on the bulk-register
record, filtered out at build_features.py:37).

tenure removal stays; v2.0.0 still MAJOR (a column was removed).
2026-05-17 12:41:14 +00:00
Khalim Conn-Kowlessar
6aa3ddfbf4 drop tenure + transaction_type from features (v2.0.0)
Neither field physically affects SAP rating; they're dataset-side metadata
(owner-occupied vs rented, sale vs marketed) and any correlation with
sap_score is confounded with age/condition that the model already sees
through built_form / property_type / construction_age_band.

Dropping reduces feature count and removes a source of spurious split-gain.
MAJOR per ADR-0007 versioning policy (column removal): 1.0.0 -> 2.0.0.
2026-05-17 12:37:52 +00:00
Khalim Conn-Kowlessar
e8b6f19a3a fix(16d): predicted_lighting_kwh handles None bulb counts
EPC bulb-count fields are Optional[int]; 1k-cert sanity-check from slice 16h
hit None + None TypeError. Coerce to 0 before sum.
2026-05-17 12:25:59 +00:00
Khalim Conn-Kowlessar
700ff4640c slice 16g: LightGBM objective=mape for sap_score + peui_ucl
Per ADR-0008: the v15 baseline reports MAPE but optimises MSE, which
under-weights tail rows. Switching to objective='mape' applies gradient
proportional to 1/|y| and lets the model focus where MAPE penalises.

Targets co2_emissions, space_heating_kwh, hot_water_kwh, and peui_raw
retain the default 'regression' objective (some rows have ~zero CO2 from
heavy PV; MAPE objective destabilises near zero).

Sample weights deferred to slice 16i if slice 16h's per-decile residuals
still show tail bias after the objective switch.
2026-05-17 12:06:13 +00:00
Khalim Conn-Kowlessar
5c20e323da slice 16f: rename secondary_dwelling_* -> extension_1_* (v1.0.0 MAJOR bump)
12 columns renamed; extension_2_* not added (88% null on 250k corpus;
envelope_heat_loss_w_per_k already sums extension_2+ via part-iterator).
ADR-0008.

VERSION 0.4.0 -> 1.0.0 (MAJOR per ADR-0007 versioning policy). Coordinated
cutover with AutoGluon repo + scoring lambda required at deploy time.

features_v16.txt is regenerated from transform.schema() at write-parquet time
(data/ml_training is gitignored; not committed).
2026-05-17 12:05:01 +00:00
Khalim Conn-Kowlessar
cda469dd7d slice 16e: predicted_total_fuel_cost / predicted_ecf / predicted_log10_ecf
ECF reconstruction per SAP10 §20.1 (Mid physics, ADR-0008):

  total_cost_gbp = (space_kwh*p_space + dhw_kwh*p_dhw + light_kwh*p_elec) / 100
  ECF = 0.42 * total_cost / (TFA + 45)
  log10_ecf = log10(ECF)   [0 for non-positive]

p_* are Table 32 unit prices via fuel_unit_price_p_per_kwh. Standing
charges deliberately omitted (constant fuel-mix offset; ADR-0008).

predicted_sap_score is NOT emitted as a feature (ADR-0008 Mid not Deep):
the model is left to learn the piecewise log/linear transform from
log10_ecf -> SAP itself, keeping the data layer SAP-version-agnostic.

VERSION 0.3.0 -> 0.4.0 (MINOR).
2026-05-17 12:00:06 +00:00
Khalim Conn-Kowlessar
eee5421112 slice 16d: predicted_space/hot_water/lighting_kwh + seasonal-efficiency features
New module domain.ml.demand emits crude annual demand approximations
(ADR-0008 "crude annual"):

  predicted_space_heating_kwh = HLC * HDH_region * 1e-3 / efficiency_main
  predicted_hot_water_kwh     = SAP10.2 J simplified (Vd, dT, +10% losses)
  predicted_lighting_kwh      = 9.3 * TFA reduced by LED/CFL share

HDH lookup covers SAP10.2's 22 regions; fallback UK avg = 53,000 K*h/yr.

Plus two seasonal-efficiency features straight off the Table 4a/4b lookup
from slice 16b (seasonal_efficiency_main_heating /
seasonal_efficiency_water_heating).

Wired into to_row; VERSION 0.2.0 -> 0.3.0 (MINOR).
2026-05-17 11:57:29 +00:00
Khalim Conn-Kowlessar
fca8815991 slice 16c: envelope_heat_loss_w_per_k feature
New module domain.ml.envelope sums Sigma(U*A) + y*A_exposed across every
sap_building_part on a cert. U-values come from rdsap_uvalues' cascade
defaults, so the feature is never null.

Per-part inputs: wall / roof / floor / party-wall / windows / doors.
Windows + doors are apportioned to the main part (first in the list)
per RdSAP10 convention.

Wired into EpcMlTransform.to_row; transform VERSION 0.1.0 -> 0.2.0
(MINOR bump for an additive column per the ADR-0007 policy).

7 envelope unit tests + 2 transform-level tests, all AAA. Reference
geometry: 100 m^2 age-G mid-terrace -> ~208 W/K; doubles for two
storeys; drops with better insulation; sums across extensions.
2026-05-17 11:53:43 +00:00
Khalim Conn-Kowlessar
67a4f92d53 slice 16b: sap_efficiencies.py with Table 4a/4b/32 lookups
Encodes SAP10.2 Table 4a (heating-system code -> space-eff %), Table 4b
(gas/oil boiler winter eff %), and Table 32 (fuel-code -> p/kWh).

Helpers:
- seasonal_efficiency(code) -> decimal; unknown -> 0.80 (gas-boiler typical)
- water_heating_efficiency(water_code, main_code) -> decimal; codes
  901/914 inherit the main code's efficiency
- fuel_unit_price_p_per_kwh(fuel_code) -> p/kWh; unknown -> 3.48 (mains gas)

All returns are total. Provides the seasonal-efficiency input to slice 16d
and the price multipliers for slice 16e's cost reconstruction.
2026-05-17 11:45:40 +00:00
Khalim Conn-Kowlessar
8bd8f8a622 slice 16a: rdsap_uvalues.py with cascade-defaulting U-value helpers
Encodes RdSAP10 Tables 6-9 (walls), 15 (party walls), 16+18 (roofs),
19+BS EN ISO 13370 (floors), 20 (upper floors), 21 (thermal bridging),
24 (windows), 26 (doors).

Helpers (u_wall / u_roof / u_floor / u_window / u_door / u_party_wall /
thermal_bridging_y) cascade through cert -> age-band default ->
country default -> mid-range fallback so the envelope-heat-loss feature
is never null. Mirrors the RdSAP "assume as-built if no evidence" rule.

Country.from_code collapses EAW/GB/UK/unknown to ENG; SCT/NIR/WAL get
explicit K-M overrides where Tables 7-9 diverge from Table 6 (England).

28 tests, all AAA, cover the reference values and the cascade fallbacks.
2026-05-17 11:36:39 +00:00
Khalim Conn-Kowlessar
f61d74a327 docs: ADR-0008 physics-as-feature + v16.0.0 schema bump
Captures the slice-16 plan decisions before code lands:
- Mid-physics: predicted_ecf + predicted_log10_ecf, NOT predicted_sap_score
- Cost scope: heating + DHW + lighting (no PV/pumps/secondary)
- Crude annual heat-demand calc (HLC * HDH / efficiency)
- Cascade-defaulting U-value imputation
- envelope_heat_loss_w_per_k sums all parts; extension_1 only as discrete features (88% null drops extension_2)
- v16.0.0 MAJOR bump (rename secondary_dwelling_* -> extension_1_*); coordinated cutover with AutoGluon repo + scoring lambda
- LightGBM objective="mape" for sap_score+peui_ucl in 16g; sample weights deferred
2026-05-17 11:20:40 +00:00
Khalim Conn-Kowlessar
fd8d71eb05 slice 15e: per-decile residuals reporting in train_baseline
Adds `_per_decile_residuals` and writes `residuals_<target>.json` next to
metrics.json. Buckets test-set rows by deciles of the true target value;
each bucket carries count + MAPE + MAE + mean residual + true_min/max.

Lets us tell whether errors concentrate in the tails of the true distribution
(e.g. SAP<40 / SAP>85) vs the mid-band — which the global MAPE alone hides.
Baseline for slice 16's MAPE-improvement ablations.
2026-05-17 11:18:40 +00:00
Khalim Conn-Kowlessar
195336b7e1 slice 15d: +50 features (gap fill + secondary building part); drop 2 derived
Removes:
  - environmental_impact_current (SAP-derived rating, leaks into co2 target)
  - energy_rating_average (average of sap_score + potential, direct leak)

Adds:
  Doors            draughtproofed_door_count, insulated_door_u_value
  Hot water        cylinder_insulation_type, cylinder_thermostat,
                   secondary_heating_type
  Ventilation      mechanical_vent_duct_placement, _duct_insulation,
                   _duct_insulation_level, _measured_installation
  Lighting         low_energy_fixed_lighting_bulbs_count,
                   fixed_lighting_outlets_count,
                   low_energy_fixed_lighting_outlets_count
  Windows          window_avg_glazing_gap_mm, window_avg_frame_factor,
                   window_pct_permanent_shutters_insulated
  Main dwelling    room_in_roof_floor_area_m2, alternative_wall_count,
                   alternative_wall_area_m2, flat_roof_insulation_thickness_mm,
                   wall_thickness_measured
  Element counts   wall_count, roof_count, floor_count,
                   main_heating_count_elements, main_heating_controls_present
  Wind             wind_turbine_hub_height_m, wind_turbine_rotor_diameter_m
  Flat             flat_unheated_corridor_length_m
  Addendum         addendum_stone_walls, addendum_system_build,
                   addendum_numbers_count
  LZC              lzc_energy_sources_count
  Secondary part   secondary_dwelling_present + 11 fabric features
                   (wall/roof/floor construction + insulation + thickness
                   + area + heat-loss perimeter) + other_building_parts_count

Wires through schema -> domain -> mapper: adds Addendum dataclass,
lzc_energy_sources, mechanical_vent_duct_insulation_level. Also fixes
_measurement_value to accept raw dicts (from_dict left some Measurement
fields as dict when they weren't typed as a dataclass).

Results at N=25,000 2026 RdSAP certs:
  sap_score          MAPE=0.043  sMAPE=0.036  R^2=0.891
  co2_emissions      sMAPE=0.106  R^2=0.929
  peui_raw           MAPE=0.087  sMAPE=0.084  R^2=0.860
  peui_ucl           MAPE=0.079  sMAPE=0.076  R^2=0.866
  space_heating_kwh  MAPE=0.112  sMAPE=0.108  R^2=0.947
  hot_water_kwh      MAPE=0.071  sMAPE=0.069  R^2=0.854  (+0.082 R^2 vs 15b)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 10:13:03 +00:00
Khalim Conn-Kowlessar
a1f89b6033 slice 15c: stream build_features so 500k+ cert runs fit memory
Previously kept the full list of EpcPropertyData in memory before calling
EpcMlTransform.to_rows. For the 25k slice that's ~30 MB; for the 580k
full-2026 corpus it OOM-killed the process silently. Now: parse cert ->
to_row -> append dict -> drop EpcPropertyData reference, so memory is
O(row-dict * n) instead of O(EpcPropertyData * n). Same end-of-frame
post-processing (categorical casts, column-order pin).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 00:36:53 +00:00
Khalim Conn-Kowlessar
9f6f7608b9 slice 15b: +18 features — heating type code, hot water, windows, flat, supply
Heating: primary_sap_main_heating_code (the SAP10 heating-system enum was the
single biggest missing input), primary_emitter_temperature,
primary_main_heating_fraction.

Hot water: immersion_heating_type, shower_outlet_count.

Windows: window_pct_living, window_pct_external, window_pct_permanent_shutters
(area-weighted shares parallel to existing window aggregates).

Dwelling: conservatory_type, has_heated_separate_conservatory.

Flat-only block (sap_flat_details): flat_level, flat_top_storey,
flat_storey_count, flat_location, flat_heat_loss_corridor (int sentinels
like '20+' coerce to None for the categorical features).

Energy supply: meter_type, pv_connection, wind_turbines_terrain_type.

Also plumbs `air_tightness` EnergyElement, `sap_flat_details` and
`has_heated_separate_conservatory` through the 21.0.1 mapper path (they were
silently None before).

Results at N=25,000 2026 RdSAP certs:
  sap_score          MAPE=0.044  sMAPE=0.038  R^2=0.884  (+0.045 R^2 vs 15a)
  co2_emissions      sMAPE=0.108  R^2=0.925
  peui_raw           MAPE=0.092  sMAPE=0.088  R^2=0.849
  peui_ucl           MAPE=0.081  sMAPE=0.078  R^2=0.860
  space_heating_kwh  MAPE=0.111  sMAPE=0.108  R^2=0.945
  hot_water_kwh      MAPE=0.081  sMAPE=0.079  R^2=0.772

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-17 00:08:11 +00:00
Khalim Conn-Kowlessar
0ffda529ec slice 15a: add wall/floor/roof + demand scalar features for retrofit simulation
15 new features wired through schema -> domain -> mapper -> transform:

Main Dwelling fabric (11):
  - wall_insulation_type, wall_insulation_thickness_mm, wall_dry_lined,
    wall_thickness_mm, party_wall_construction
  - roof_insulation_location, roof_insulation_thickness_mm
  - floor_construction, floor_insulation, floor_insulation_thickness_mm,
    floor_heat_loss

Dwelling-level scalars (4):
  - multiple_glazed_proportion, number_baths, number_baths_wwhrs,
    extract_fans_count

Thickness strings like '50mm'/'NI'/'ND' parsed via _parse_thickness_mm; NI
(no insulation) lands as 0mm so the model sees the physical zero rather than
a missing value. Categorical sentinels ('NA'/'NI'/'ND') become None.

Also fixed long-standing typo `multiple_glazed_propertion` -> `_proportion`
in domain dataclass + its lone DB-model usage.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 22:08:27 +00:00
Khalim Conn-Kowlessar
c496f345f8 slice 14l: bigger-run fixes — UCL guard, PV Measurement coercion, sMAPE
Three changes surfaced by the 25k 2026 run:
- transform._peui_ucl returns None for non-positive raw PEUI (net-exporters).
  apply_ucl_correction would otherwise raise ValueError on negative input.
- PhotovoltaicArray scalars (peak_power, pitch, orientation, overshading)
  now accept Measurement | int | float in the schema; mapper coerces via
  _measurement_value.
- train_baseline reports sMAPE alongside MAPE — handles zero-actual rows
  (e.g. co2_emissions for net-zero certs) where MAPE explodes.

Results at N=25,000 RdSAP 2026 certs (~32s end-to-end):
  sap_score          MAPE=0.064  sMAPE=0.054  R^2=0.762
  co2_emissions      sMAPE=0.140  R^2=0.890
  peui_raw           MAPE=0.126  sMAPE=0.120  R^2=0.714
  peui_ucl           MAPE=0.114  sMAPE=0.108  R^2=0.736
  space_heating_kwh  MAPE=0.167  sMAPE=0.157  R^2=0.915
  hot_water_kwh      MAPE=0.089  sMAPE=0.086  R^2=0.737

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 21:15:37 +00:00
Khalim Conn-Kowlessar
8fddd25b9a slice 14k: E2E pipeline runs on real 2026 RdSAP certs
Two production fixes surfaced by the live run:
- mapper.from_rdsap_schema_21_0_1 now sets the three ML target scalars
  (energy_rating_current, co2_emissions_current, energy_consumption_current).
  They were silently None for every cert before, leaving the only labels as
  the kWh fields from renewable_heat_incentive.
- train_baseline coerces object-dtype columns to numeric (None -> NaN) and
  drops rows with null target per fit, so LightGBM accepts the frame.

E2E on 500 real certs (~1s):
  sap_score             R^2=0.604  MAPE=0.084
  co2_emissions         R^2=0.813  MAPE=0.130
  peui_raw              R^2=0.979  MAPE=0.026
  space_heating_kwh     R^2=0.823  MAPE=0.213
  hot_water_kwh         R^2=0.519  MAPE=0.115

peui_ucl excluded: UCL correction still needs wiring.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 20:47:41 +00:00
Khalim Conn-Kowlessar
6697a6c76e slice 14j: Optional sweep across schema 21.0.1 + mapper guards
Across 500 real RdSAP-21.0.1 certs from 2026, mapper goes 0% -> 100% success.
Schema-loading + ml-transform + ml_training_data: 146 tests pass.

Mainly affected fields:
- SapHeating: instantaneous_wwhrs, shower_outlets (now Union with List shape)
- SapWindow: glazing_gap, frame_factor, pvc_frame, window_transmission_details
- SapEnergySource: pv_battery_count, wind_turbine_details, pv_batteries (List form)
- SapBuildingPart: all 13 sub-fields now Optional
- SapFloorDimension: Measurement | int | float fallback
- RdSapSchema21_0_1: 16 top-level fields (mechanical_vent_*, lighting counts, ...)

Mapper helpers added: _measurement_value, _first_pv_battery, _first_shower_outlet.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 20:35:28 +00:00
Khalim Conn-Kowlessar
ccb654c230 slice 14i: pin real RdSAP cert as fixture + RED regression test
Currently fails on SapWindow.glazing_gap (first of ~30 fields the dataclass
incorrectly treats as required). Will go GREEN once 14j sweeps Optional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 20:23:29 +00:00
Khalim Conn-Kowlessar
611c07de94 slice 14h: handle real bulk-JSON shape (NDJSON wrappers + document payload)
Bulk entries are NDJSON of wrapper records, not a JSON array. Each wrapper
carries certificate_number, assessment_type, and a stringified document with
the actual EPC schema payload. Filter to RdSAP, unwrap document, then map.

remote_bulk_fetcher: per-entry presigned-URL refresh (30s S3 TTL).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 19:45:52 +00:00
Khalim Conn-Kowlessar
9eb70cede1 slice 14g: remote_bulk_fetcher extracts ZIP entries via HTTP Range (no full download)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 19:16:52 +00:00
Khalim Conn-Kowlessar
b676e05d49 slice 14f: train_baseline fits LightGBM per target, emits MAPE/R^2 + importance
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:47:49 +00:00
Khalim Conn-Kowlessar
23ba2ef271 slice 14e: write_training_dataset emits parquet + schema.json + manifest.json
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:43:31 +00:00
Khalim Conn-Kowlessar
20fd55d5a1 slice 14d: build_features wires bulk reader -> mapper -> EpcMlTransform
ijson use_float fixes Decimal/float coercion when streaming JSON.
pyright extraPaths so the new pkg type-checks against domna-domain.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:38:41 +00:00
Khalim Conn-Kowlessar
0ff9d546b8 slice 14c: BulkZipReader streams certs from gov bulk JSON ZIP
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 18:27:24 +00:00
Khalim Conn-Kowlessar
7a6c8b4f24 slice 14b: Storage protocol + LocalStorage impl
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 17:52:54 +00:00
Khalim Conn-Kowlessar
eb42cb88a1 slice 14a: ml_training_data pkg + sample.py (CSV filter + random sample)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 17:39:43 +00:00
Khalim Conn-Kowlessar
3abcee6a53 slice 13: to_rows(properties) returns pd.DataFrame
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:43:28 +00:00
Khalim Conn-Kowlessar
ebceb4bf2b slice 12: ventilation flat features
Four ventilation features: mechanical_ventilation (categorical
SAP10 code, 0=natural through 6=positive-input-from-outside per
epc_codes.csv mechanical_ventilation enum), mechanical_vent_duct_type
(categorical), blocked_chimneys_count (int), and pressure_test
(int — air-tightness SAP10 code).

Pulled from top-level EpcPropertyData fields; ventilation on SAP10
API EPCs sits on the certificate directly, not on the
sap_ventilation block (which is site-notes-only).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:09:53 +00:00
Khalim Conn-Kowlessar
559a2128b9 slice 11b: PV battery, wind turbine, energy source flags
Nine more energy-source features land: has_pv_battery,
pv_battery_count, pv_battery_capacity_kwh (count × per-unit
capacity from pv_batteries.pv_battery, nullable when count=0),
has_wind_turbine, wind_turbine_count, mains_gas (the dominant
fuel-deduction signal), and the three smart-meter / export
booleans (electricity_smart_meter_present, gas_smart_meter_present,
is_dwelling_export_capable).

Closes the PV/solar feature group started in slice 11a.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:07:17 +00:00
Khalim Conn-Kowlessar
706d1b5b66 slice 11a: PV array aggregates + capacity_source flag
Fifteen PV features land: has_pv (bool), pv_capacity_source (str
categorical: measured / estimated_from_roof_area / none),
pv_array_count, pv_total_peak_power_kw, eight peak-power-by-octant
columns (pv_peak_power_kw_{N..NW}), peak-power-weighted
pv_avg_pitch and pv_avg_overshading (nullable), and
pv_percent_roof_area (nullable — populated only on the estimated
branch).

Dispatches on the SAP10 EpcPropertyData.SapEnergySource shapes added
in slice 10.5: photovoltaic_arrays populates → measured;
photovoltaic_supply.none_or_no_details.percent_roof_area > 0 →
estimated; everything else → none. percent_roof_area == 0 is the
canonical no-PV payload and surfaces as 'none', not 'estimated'.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:04:15 +00:00
Khalim Conn-Kowlessar
b050348927 slice 10.5: PhotovoltaicArray on SAP10 schema + EpcPropertyData
SAP10 EPCs with measured PV carry photovoltaic_supply as a nested
list of arrays (peak_power, pitch, orientation, overshading) rather
than the legacy unmeasured wrapper {none_or_no_details:
{percent_roof_area: N}}. The schema-21 dataclasses now accept both
shapes via Union[PhotovoltaicSupply, List[List[PhotovoltaicArray]]],
and from_dict._coerce now dispatches list values onto list type
variants of multi-type Unions.

EpcPropertyData.SapEnergySource gains
photovoltaic_arrays: Optional[List[PhotovoltaicArray]] — populated
when the measured shape is present, otherwise None. The legacy
photovoltaic_supply field is preserved for the fallback case.
Both schema-21.0.0 and 21.0.1 mappers dispatch via the new
_map_schema_21_pv helper.

Unblocks Slice 11 (PV feature aggregation in EpcMlTransform).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 16:00:25 +00:00
Khalim Conn-Kowlessar
fff6ef3352 slice 10: heating system features (primary + water + secondary)
Fifteen heating features land via hybrid Top-1 + flat fields: the
primary heating slot from main_heating_details[0] gives
main_fuel_type, heat_emitter_type, main_heating_control,
main_heating_category, has_fghrs, fan_flue_present, boiler_flue_type
and central_heating_pump_age (all int-categorical for the SAP10
codes); main_heating_count carries the aggregate. Water heating
adds water_heating_code, water_heating_fuel, cylinder_size, and
cylinder_insulation_thickness_mm. Secondary heating is summarised
by has_secondary_heating (derived) and secondary_fuel_type.

Fuel codes follow the gov api enums in epc_codes.csv (44 main_fuel
values shared with water_heating_fuel). Union[int, str] fields
coerce to int when the value is int, else None.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 15:50:05 +00:00
Khalim Conn-Kowlessar
fb773fa635 slice 9: building parts with main-dwelling carve-out
Thirteen building-parts features land: five cross-all-parts physical
aggregates (count, total_heat_loss_perimeter_m,
total_party_wall_length_m, total_floor_area_from_parts_m2,
avg_room_height_m) and eight Main-Dwelling-specific columns
(heat_loss_perimeter, party_wall_length, total_floor_area,
avg_room_height, has_room_in_roof, construction_age_band,
wall_construction, roof_construction). Main-Dwelling columns are
None when no part has identifier == 'Main Dwelling' — honest about
data quality rather than silently falling back to the first part.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 15:45:21 +00:00
Khalim Conn-Kowlessar
079e6f9a68 slice 8b: window glazed_type and pvc_frame shares
Adds seventeen window-categorical-share features: one float per
SAP10 glazed_type code (1-15) plus a `_other` bucket for anything
outside the enum, and a single `window_pct_pvc_frame` for the
area-weighted PVC-frame share. All shares are area-weighted over
total window area; null pvc_frame share for window-less properties.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 15:36:05 +00:00
Khalim Conn-Kowlessar
dba254e316 slice 8a: window physics and orientation aggregates
Thirteen window-aggregate features land on the transform: count,
total area, eight SAP-octant area columns (N/NE/E/SE/S/SW/W/NW),
area-weighted draught-proofing pct, and area-weighted u_value +
solar transmittance (nullable, populated only when windows carry
transmission_details). Windows with orientation outside 1-8 (0,
NR) contribute to count and total area but no octant.

Also: epc codes CSV (gov api /api/codes export, RdSAP-Schema-21.x +
older versions) moved next to EpcPropertyData as epc_codes.csv —
canonical SAP enum source for upcoming categorical-share slices.
.gitignore exception added so the reference CSV is tracked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 15:32:45 +00:00
Khalim Conn-Kowlessar
9c8aa75469 slice 7: flat categoricals + ColumnSpec.categorical flag
Adds seven flat categorical features (dwelling_type, tenure,
transaction_type, property_type, built_form, region_code,
country_code) emitted as raw strings. New ColumnSpec.categorical
bool tells the parquet writer to cast these to pd.Categorical at the
I/O boundary, keeping pandas out of the domain/schema module.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 15:14:30 +00:00
Khalim Conn-Kowlessar
e4f9e9e1db slice 6: flat booleans and optional integer indicators
Adds three non-nullable booleans (solar_water_heating,
has_hot_water_cylinder, has_fixed_air_conditioning) and three
optional integer indicators (percent_draughtproofed,
energy_rating_average, environmental_impact_current). All direct
EpcPropertyData field reads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 15:06:02 +00:00
Khalim Conn-Kowlessar
e9b4dbbfe5 slice 5: room, door and lighting count features
Ten flat int counts added to the transform — door_count,
habitable/heated/wet/insulated_door counts, extensions, open
chimneys, and the three fixed-lighting bulb counts (CFL/LED/
incandescent). All non-nullable; direct EpcPropertyData field reads.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 15:03:58 +00:00
Khalim Conn-Kowlessar
aa00259b1a slice 4: total_floor_area_m2 feature
First feature column lands on the transform: schema() advertises
total_floor_area_m2 as a non-nullable float; to_row() emits the value
from EpcPropertyData.total_floor_area_m2 alongside the six targets.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 14:47:25 +00:00
Khalim Conn-Kowlessar
375b0e895e add missing ucl.py and _fixtures.py from slices 2-3
Previous slice commits used -a-style and missed these new files;
imports in transform.py and test_transform.py would dangle on a
fresh checkout. Re-running pytest after this commit covers all four
EpcMlTransform tests cleanly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-16 14:43:09 +00:00
Khalim Conn-Kowlessar
81f6163295 added ucl corrected peui 2026-05-16 14:39:24 +00:00
Khalim Conn-Kowlessar
a64e7e74c5 adding kwh feidls to EpcPropertyData and testing to_row 2026-05-16 14:33:25 +00:00
Khalim Conn-Kowlessar
611ff24eb6 scaffolding for ml pipeline 2026-05-16 14:15:56 +00:00
Khalim Conn-Kowlessar
dfe9e3ddbe added potential file scaffolding: 2026-05-15 10:56:53 +00:00
Khalim Conn-Kowlessar
acb2518235 second grill session updating prd + context 2026-05-15 10:41:47 +00:00
Khalim Conn-Kowlessar
8d6c770da8 grilling session updates to prd 2026-05-14 16:36:22 +00:00
Khalim Conn-Kowlessar
f8bd13cb63 editing per portfolio feature flag 2026-05-14 07:39:18 +00:00
Khalim Conn-Kowlessar
02df38e207 note kwh service not needing predictions 2026-05-13 21:52:02 +00:00