Model/model_data/simulation_system/README.md
2023-08-29 18:16:46 +01:00

77 lines
4.3 KiB
Markdown

# Simulation System
Starter Readme:
Steps for pipeline:
- (WIP) Set up the training development environment
- Change directory to this folder (simulation_system)
- Run the following command `make env PYTHON_VERSION=3.10.12`
- This will install the specified python version using `pyenv` and select this version as the global python version
- It will install all training packages as specified in the training-dev.txt requirements file, including the pre-commit hooks
- Run `source .training_env/bin/activate` to use this environment
- (WIP) Use Makefile to start up mock up s3 service
- By running `make init`, this will run the `docker-compose build` and `docker-compose up -d`, which spins up a S3 service
- This docker compose is running in detached mode `-d`, so will no output anything to the terminal
- Once the Minio service is run, you can run the `training.py` file to start a model build process
- This will output a model, for a given target column, and add model name composed of some of the hyperparameters
- An example of running this file is:
- `python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet`
- Outputs of the pipeline are:
- A model directory bucket
- A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
- A model type prefix (i.e. autogluon, tensorflow etc)
- A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
- This model name is made up of target variable, quality, time spent training and timestamp
- Within this prefix, there are three folders:
- model
- The model path that can be loaded in the codebase
- deployment
- The optimised model that can be deployed (may or maynot need this)
- metrics
- The metrics generatted from the model (may or may not need this as this can be contained in the registry)
- Once model build is finished, you can run the `prediction.py` file to generate prediction
- By default, the prediction pipeline will select the best model based on **mean absolute error** from the model registry
- This can be overwritten by specifying a model_path, which will load an alternative model
- There are two ways of getting data into the pipeline:
- Using the `--data` argument:
- This is a JSON string which can be passed as `python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'`
- Note the single and double quotation marks, as this affects the ingestion
- Using the `--data-path` argument:
- This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
- An example of running the file is:
- `python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet`
- Outputs of the pipeline are:
- prediction bucket
- a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
- a uprn prefix (i.e 0123456789)
- a `prediction.json`
- a `metadata.json`
- This is all the metadata from the model (can change this if needed)
- NOTE: If you wish to change any settings, these are currently all in the `Settings.py` file
- It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
- I.e. the hyperparameters for models are in here but will move into a separate configuration file
# TODO:
- Structure/ MLOps:
- Add configuration files (dev, staging, prod), including hyperparamters
- Add precommit hooks (linters, branch names, etc)
- Sphinx documentation
- Sort out local mock up services
- Sort out Model Registry
- Sort out Data version control
- pre-commit hooks:
- The types of hooks that we want (safety, bandit, iso8 etc)
- The customisations we require
- Add sphinx documentation
- Data Science:
- Implement a metrics class, to hold all metric
- Rebuild metrics script (Could be a one off but good to have)
- Determine metrics
- Implement and test custom model (Tensorflow Decision Trees etc)
- Orchestration:
- Lambda handler for the pipeline