# Simulation System Starter Readme: Steps for pipeline: - (WIP) Use Makefile to start up mock up s3 service - By running `make init`, this will run the `docker-compose build` and `docker-compose up -d`, which spins up a S3 service - This docker compose is running in detached mode `-d`, so will no output anything to the terminal - Once the Minio service is run, you can run the `training.py` file to start a model build process - This will output a model, for a given target column, and add model name composed of some of the hyperparameters - An example of running this file is: - `python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet` - Outputs of the pipeline are: - A model directory bucket - A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE) - A model type prefix (i.e. autogluon, tensorflow etc) - A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP) - This model name is made up of target variable, quality, time spent training and timestamp - Within this prefix, there are three folders: - model - The model path that can be loaded in the codebase - deployment - The optimised model that can be deployed (may or maynot need this) - metrics - The metrics generatted from the model (may or may not need this as this can be contained in the registry) - Once model build is finished, you can run the `prediction.py` file to generate prediction - By default, the prediction pipeline will select the best model based on **mean absolute error** from the model registry - This can be overwritten by specifying a model_path, which will load an alternative model - There are two ways of getting data into the pipeline: - Using the `--data` argument: - This is a JSON string which can be passed as `python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'` - Note the single and double quotation marks, as this affects the ingestion - Using the `--data-path` argument: - This can be a filepath (Can imagine that we might want to pull data from S3/ DB) - An example of running the file is: - `python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet` - Outputs of the pipeline are: - prediction bucket - a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE) - a uprn prefix (i.e 0123456789) - a `prediction.json` - a `metadata.json` - This is all the metadata from the model (can change this if needed) - NOTE: If you wish to change any settings, these are currently all in the `Settings.py` file - It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify. - I.e. the hyperparameters for models are in here but will move into a separate configuration file # TODO: - Structure/ MLOps: - Add configuration files (dev, staging, prod), including hyperparamters - Add precommit hooks (linters, branch names, etc) - Sphinx documentation - Sort out local mock up services - Sort out Model Registry - Sort out Data version control - Data Science: - Implement a metrics class, to hold all metric - Rebuild metrics script (Could be a one off but good to have) - Determine metrics - Implement and test custom model (Tensorflow Decision Trees etc) - Orchestration: - Lambda handler for the pipeline