mirror of https://github.com/Hestia-Homes/Model.git synced 2026-07-27 23:35:01 +00:00

History

Khalim Conn-Kowlessar 3a864c6c21 Added the drop columns for roof		2023-09-19 11:20:54 +01:00
..
core	fixed but which is dropping roof description	2023-09-17 22:20:23 +01:00
Dockerfiles	cleaned up docker file for local testing	2023-09-04 14:46:33 +01:00
handlers	Fixing issue with sap model parsing event body	2023-09-12 18:48:33 +01:00
MLModel	lambda can only write to /tmp folder to using /tmp for model download	2023-09-05 12:43:51 +01:00
model_build_data/change_data/rdsap_full	begin boto3 chagne	2023-09-01 18:19:07 +01:00
preprocessed_data	add a test data generation script	2023-08-12 19:11:22 +00:00
requirements	cleaned up docker file for local testing	2023-09-04 14:46:33 +01:00
.gitignore	fixed minor bug with reading in model	2023-09-05 12:18:42 +01:00
.pre-commit-config.yaml	add pre-commit hook, no customisation on black	2023-08-29 17:28:48 +01:00
__init__.py	Adding database connection to fastapi	2023-07-31 11:02:22 +01:00
area_data.py	basic process for creating of area dataset complete	2023-09-04 12:08:55 +01:00
docker-compose.yml	Update docker-compose.yml	2023-09-04 23:38:34 +01:00
energy_predictor.py	Recolved merge conflicts	2023-08-25 12:15:19 +01:00
generate_rdsap_change.py	Added the drop columns for roof	2023-09-19 11:20:54 +01:00
Makefile	Add extra steps to readme and makefile	2023-08-29 18:16:46 +01:00
predictions.py	lambda can only write to /tmp folder to using /tmp for model download	2023-09-05 12:43:51 +01:00
README.md	Add extra steps to readme and makefile	2023-08-29 18:16:46 +01:00
regenerate_metrics.py	Added hnalder and dockerfiles	2023-08-31 14:46:10 +01:00
requirements.txt	add requirements for seaborn	2023-08-19 13:57:45 +00:00
test_data_generation.py	Added the drop columns for roof	2023-09-19 11:20:54 +01:00
training.py	cleaned up docker file for local testing	2023-09-04 14:46:56 +01:00

README.md

Simulation System

Starter Readme: Steps for pipeline:

(WIP) Set up the training development environment
- Change directory to this folder (simulation_system)
- Run the following command make env PYTHON_VERSION=3.10.12
- This will install the specified python version using pyenv and select this version as the global python version
- It will install all training packages as specified in the training-dev.txt requirements file, including the pre-commit hooks
- Run source .training_env/bin/activate to use this environment
(WIP) Use Makefile to start up mock up s3 service
- By running make init, this will run the docker-compose build and docker-compose up -d, which spins up a S3 service
- This docker compose is running in detached mode -d, so will no output anything to the terminal
Once the Minio service is run, you can run the training.py file to start a model build process
- This will output a model, for a given target column, and add model name composed of some of the hyperparameters
- An example of running this file is:
  - python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
  - A model directory bucket
  - A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
  - A model type prefix (i.e. autogluon, tensorflow etc)
  - A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
    - This model name is made up of target variable, quality, time spent training and timestamp
    - Within this prefix, there are three folders:
      - model
        
        The model path that can be loaded in the codebase
      - deployment
        
        The optimised model that can be deployed (may or maynot need this)
      - metrics
        
        The metrics generatted from the model (may or may not need this as this can be contained in the registry)
Once model build is finished, you can run the prediction.py file to generate prediction
- By default, the prediction pipeline will select the best model based on mean absolute error from the model registry
  - This can be overwritten by specifying a model_path, which will load an alternative model
- There are two ways of getting data into the pipeline:
  - Using the --data argument:
    - This is a JSON string which can be passed as python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'
      - Note the single and double quotation marks, as this affects the ingestion
  - Using the --data-path argument:
    - This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
- An example of running the file is:
  - python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
  - prediction bucket
  - a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
  - a uprn prefix (i.e 0123456789)
  - a prediction.json
  - a metadata.json
    - This is all the metadata from the model (can change this if needed)
NOTE: If you wish to change any settings, these are currently all in the Settings.py file
- It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
  - I.e. the hyperparameters for models are in here but will move into a separate configuration file

TODO:

Structure/ MLOps:
- Add configuration files (dev, staging, prod), including hyperparamters
- Add precommit hooks (linters, branch names, etc)
- Sphinx documentation
- Sort out local mock up services
- Sort out Model Registry
- Sort out Data version control
- pre-commit hooks:
  - The types of hooks that we want (safety, bandit, iso8 etc)
  - The customisations we require
- Add sphinx documentation
Data Science:
- Implement a metrics class, to hold all metric
- Rebuild metrics script (Could be a one off but good to have)
- Determine metrics
- Implement and test custom model (Tensorflow Decision Trees etc)
Orchestration:
- Lambda handler for the pipeline