diff --git a/backend/onboarders/README.md b/backend/onboarders/README.md new file mode 100644 index 00000000..063fee20 --- /dev/null +++ b/backend/onboarders/README.md @@ -0,0 +1,102 @@ +# Retrofit Property Data Onboarding + +This repository contains an ETL pipeline for transforming raw retrofit property data from external source systems ( +currently Parity) into a standardised internal format, compatible for both address2uprn and engine. + +The pipeline is designed to: + +- Run as an AWS Lambda triggered by SQS +- Read raw CSV/XLSX files from S3 +- Perform rule-based mappings +- Infer as built property attributes, assumed based on age +- Output a processed csv, back to s3 to be consumed by address2uprn + +### Structure + +SQS → Lambda handler → OnboarderFactory → System-specific Onboarder → Mapping → CSV to S3 + +Each source system implements its own **Onboarder**, while sharing a common base and mapping process. + +--- + +### Repository Structure + +onboarders/ +├── `handler.py` # Lambda entrypoint \ +├── `factory.py` # Onboarder factory \ +├── `base.py` # Shared onboarding base class \ +├── `parity.py` # Parity-specific transformation logic \ +├── `mappings/` \ +│ └── `parity/` # Parity domain mappings & classifiers \ +│ ├── `age_band.py` \ +│ ├── `property_type.py` \ +│ ├── `built_form.py` \ +│ ├── `walls.py` \ +│ ├── `roof.py` \ +│ ├── `floor.py` \ +│ ├── `glazing.py` \ +│ ├── `heating.py` \ +│ ├── `as_built_wall_classifiers.py` \ +│ ├── `as_built_roof_classifiers.py` \ +│ └── `as_built_floor_classifiers.py` \ +├── `tests/` \ +├── `requirements.txt` \ +└── `README.md` + + +--- + +### Lambda Entry Point (`handler.py`) + +The Lambda handler: + +1. Consumes SQS queue +2. Validates the payload +3. Instantiates the correct onboarder via `OnboarderFactory` +4. Runs the transformation +5. Writes the transformed CSV back to S3 + +### Expected Event Payload + +```json +{ + "s3_uri": "s3://bucket/path/to/input.xlsx", + "system": "parity", + "format": "xlsx", + "sheet_name": "Sustainability" +} + +``` + +### Onboarder Base `(base.py)` + +OnboarderBase provides shared functionality across all systems. + +*Responsibilities* + +- Reading CSV/XLSX files from S3 +- Writing transformed CSVs to S3 +- Defining canonical output column names +- Providing validation helpers +- Common output - for the moment, onboards will be expected to return a csv + +### Parity Onboarder `(parity.py)` + +`ParityOnboarder` contains all Parity-specific transformation logic. + +Responsibilities* + +- Map raw Parity fields to internal EPC-aligned enums +- Infer “as-built” constructions using age bands when insulation data is missing +- Resolve energy efficiency ratings deterministically +- Normalise output into a fixed schema + +The `transform()` method orchestrates the transformation process. + +### TODOs + +- In `backend/onboarders/mappings/parity/glazing.py` we currently map the partiy descriptions + to duples of descriptions and efficiency ratings. This is okay for the moment but we may consider + using a data class, just given how error-prone this is. +- This is also true for heating mappings in `backend/onboarders/mappings/parity/heating.py` +- Implement a AI-enabled version, to replace the standardised asset list \ No newline at end of file