mirror of
https://github.com/Hestia-Homes/Model.git
synced 2026-06-08 11:17:27 +00:00
4.3 KiB
4.3 KiB
Simulation System
Starter Readme: Steps for pipeline:
-
(WIP) Set up the training development environment
- Change directory to this folder (simulation_system)
- Run the following command
make env PYTHON_VERSION=3.10.12 - This will install the specified python version using
pyenvand select this version as the global python version - It will install all training packages as specified in the training-dev.txt requirements file, including the pre-commit hooks
- Run
source .training_env/bin/activateto use this environment
-
(WIP) Use Makefile to start up mock up s3 service
- By running
make init, this will run thedocker-compose buildanddocker-compose up -d, which spins up a S3 service - This docker compose is running in detached mode
-d, so will no output anything to the terminal
- By running
-
Once the Minio service is run, you can run the
training.pyfile to start a model build process- This will output a model, for a given target column, and add model name composed of some of the hyperparameters
- An example of running this file is:
python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
- A model directory bucket
- A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
- A model type prefix (i.e. autogluon, tensorflow etc)
- A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
- This model name is made up of target variable, quality, time spent training and timestamp
- Within this prefix, there are three folders:
- model
- The model path that can be loaded in the codebase
- deployment
- The optimised model that can be deployed (may or maynot need this)
- metrics
- The metrics generatted from the model (may or may not need this as this can be contained in the registry)
- model
-
Once model build is finished, you can run the
prediction.pyfile to generate prediction- By default, the prediction pipeline will select the best model based on mean absolute error from the model registry
- This can be overwritten by specifying a model_path, which will load an alternative model
- There are two ways of getting data into the pipeline:
- Using the
--dataargument:- This is a JSON string which can be passed as
python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'- Note the single and double quotation marks, as this affects the ingestion
- This is a JSON string which can be passed as
- Using the
--data-pathargument:- This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
- Using the
- An example of running the file is:
python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
- prediction bucket
- a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
- a uprn prefix (i.e 0123456789)
- a
prediction.json - a
metadata.json- This is all the metadata from the model (can change this if needed)
- By default, the prediction pipeline will select the best model based on mean absolute error from the model registry
-
NOTE: If you wish to change any settings, these are currently all in the
Settings.pyfile- It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
- I.e. the hyperparameters for models are in here but will move into a separate configuration file
- It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
TODO:
- Structure/ MLOps:
- Add configuration files (dev, staging, prod), including hyperparamters
- Add precommit hooks (linters, branch names, etc)
- Sphinx documentation
- Sort out local mock up services
- Sort out Model Registry
- Sort out Data version control
- pre-commit hooks:
- The types of hooks that we want (safety, bandit, iso8 etc)
- The customisations we require
- Add sphinx documentation
- Data Science:
- Implement a metrics class, to hold all metric
- Rebuild metrics script (Could be a one off but good to have)
- Determine metrics
- Implement and test custom model (Tensorflow Decision Trees etc)
- Orchestration:
- Lambda handler for the pipeline