mirror of https://github.com/Hestia-Homes/Model.git synced 2026-07-27 23:35:01 +00:00

History

Michael Duong a95795c1f9 remove residual output at base folder		2023-08-19 17:04:57 +00:00
..
core	Add requirements file	2023-08-18 12:46:13 +01:00
MLModel	Add requirements file	2023-08-18 12:46:13 +01:00
model_build_data/change_data/rdsap_full	Added split training data	2023-08-15 23:13:34 +00:00
preprocessed_data	add a test data generation script	2023-08-12 19:11:22 +00:00
__init__.py	Adding database connection to fastapi	2023-07-31 11:02:22 +01:00
docker-compose.yml	Add readme and a better folder output structure. Starter code for mock up s3	2023-08-18 10:18:17 +01:00
energy_predictor.py	Add readme and a better folder output structure. Starter code for mock up s3	2023-08-18 10:18:17 +01:00
generate_rdsap_change.py	added mlmodel, prediction and training files	2023-08-17 16:07:22 +01:00
Makefile	Add readme and a better folder output structure. Starter code for mock up s3	2023-08-18 10:18:17 +01:00
predictions.py	Add requirements file	2023-08-18 12:46:13 +01:00
README.md	cleaned up settings	2023-08-18 11:09:21 +01:00
requirements.txt	add requirements for seaborn	2023-08-19 13:57:45 +00:00
test_data_generation.py	added mlmodel, prediction and training files	2023-08-17 16:07:22 +01:00
training.py	remove residual output at base folder	2023-08-19 17:04:57 +00:00

README.md

Simulation System

Starter Readme: Steps for pipeline:

(WIP) Use Makefile to start up mock up s3 service
- By running make init, this will run the docker-compose build and docker-compose up -d, which spins up a S3 service
- This docker compose is running in detached mode -d, so will no output anything to the terminal
Once the Minio service is run, you can run the training.py file to start a model build process
- This will output a model, for a given target column, and add model name composed of some of the hyperparameters
- An example of running this file is:
  - python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
  - A model directory bucket
  - A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
  - A model type prefix (i.e. autogluon, tensorflow etc)
  - A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
    - This model name is made up of target variable, quality, time spent training and timestamp
    - Within this prefix, there are three folders:
      - model
        
        The model path that can be loaded in the codebase
      - deployment
        
        The optimised model that can be deployed (may or maynot need this)
      - metrics
        
        The metrics generatted from the model (may or may not need this as this can be contained in the registry)
Once model build is finished, you can run the prediction.py file to generate prediction
- By default, the prediction pipeline will select the best model based on mean absolute error from the model registry
  - This can be overwritten by specifying a model_path, which will load an alternative model
- There are two ways of getting data into the pipeline:
  - Using the --data argument:
    - This is a JSON string which can be passed as python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'
      - Note the single and double quotation marks, as this affects the ingestion
  - Using the --data-path argument:
    - This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
- An example of running the file is:
  - python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
  - prediction bucket
  - a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
  - a uprn prefix (i.e 0123456789)
  - a prediction.json
  - a metadata.json
    - This is all the metadata from the model (can change this if needed)
NOTE: If you wish to change any settings, these are currently all in the Settings.py file
- It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
  - I.e. the hyperparameters for models are in here but will move into a separate configuration file

TODO:

Structure/ MLOps:
- Add configuration files (dev, staging, prod), including hyperparamters
- Add precommit hooks (linters, branch names, etc)
- Sphinx documentation
- Sort out local mock up services
- Sort out Model Registry
- Sort out Data version control
Data Science:
- Implement a metrics class, to hold all metric
- Rebuild metrics script (Could be a one off but good to have)
- Determine metrics
- Implement and test custom model (Tensorflow Decision Trees etc)
Orchestration:
- Lambda handler for the pipeline