3 KiB
Retrofit Property Data Onboarding
This repository contains an ETL pipeline for transforming raw retrofit property data from external source systems ( currently Parity) into a standardised internal format, compatible for both address2uprn and engine.
The pipeline is designed to:
- Run as an AWS Lambda triggered by SQS
- Read raw CSV/XLSX files from S3
- Perform rule-based mappings
- Infer as built property attributes, assumed based on age
- Output a processed csv, back to s3 to be consumed by address2uprn
Structure
SQS → Lambda handler → OnboarderFactory → System-specific Onboarder → Mapping → CSV to S3
Each source system implements its own Onboarder, while sharing a common base and mapping process.
Repository Structure
onboarders/
├── handler.py # Lambda entrypoint
├── factory.py # Onboarder factory
├── base.py # Shared onboarding base class
├── parity.py # Parity-specific transformation logic
├── mappings/
│ └── parity/ # Parity domain mappings & classifiers
│ ├── age_band.py
│ ├── property_type.py
│ ├── built_form.py
│ ├── walls.py
│ ├── roof.py
│ ├── floor.py
│ ├── glazing.py
│ ├── heating.py
│ ├── as_built_wall_classifiers.py
│ ├── as_built_roof_classifiers.py
│ └── as_built_floor_classifiers.py
├── tests/
├── requirements.txt
└── README.md
Lambda Entry Point (handler.py)
The Lambda handler:
- Consumes SQS queue
- Validates the payload
- Instantiates the correct onboarder via
OnboarderFactory - Runs the transformation
- Writes the transformed CSV back to S3
Expected Event Payload
{
"s3_uri": "s3://bucket/path/to/input.xlsx",
"system": "parity",
"format": "xlsx",
"sheet_name": "Sustainability"
}
Onboarder Base (base.py)
OnboarderBase provides shared functionality across all systems.
Responsibilities
- Reading CSV/XLSX files from S3
- Writing transformed CSVs to S3
- Defining canonical output column names
- Providing validation helpers
- Common output - for the moment, onboards will be expected to return a csv
Parity Onboarder (parity.py)
ParityOnboarder contains all Parity-specific transformation logic.
Responsibilities*
- Map raw Parity fields to internal EPC-aligned enums
- Infer “as-built” constructions using age bands when insulation data is missing
- Resolve energy efficiency ratings deterministically
- Normalise output into a fixed schema
The transform() method orchestrates the transformation process.
TODOs
- In
backend/onboarders/mappings/parity/glazing.pywe currently map the partiy descriptions to duples of descriptions and efficiency ratings. This is okay for the moment but we may consider using a data class, just given how error-prone this is. - This is also true for heating mappings in
backend/onboarders/mappings/parity/heating.py - Implement a AI-enabled version, to replace the standardised asset list