Model/model_data/simulation_system/README.md

# Simulation System

Starter Readme:
Steps for pipeline:

- (WIP) Set up the training development environment
    - Change directory to this folder (simulation_system)
    - Run the following command `make env PYTHON_VERSION=3.10.12`
    - This will install the specified python version using `pyenv` and select this version as the global python version
    - It will install all training packages as specified in the training-dev.txt requirements file, including the pre-commit hooks
    - Run `source .training_env/bin/activate` to use this environment

- (WIP) Use Makefile to start up mock up s3 service
    - By running `make init`, this will run the `docker-compose build` and `docker-compose up -d`, which spins up a S3 service
    - This docker compose is running in detached mode `-d`, so will no output anything to the terminal

- Once the Minio service is run, you can run the `training.py` file to start a model build process
    - This will output a model, for a given target column, and add model name composed of some of the hyperparameters
    - An example of running this file is:
        - `python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet`
    - Outputs of the pipeline are:
        - A model directory bucket
        - A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
        - A model type prefix (i.e. autogluon, tensorflow etc)
        - A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
            - This model name is made up of target variable, quality, time spent training and timestamp
            - Within this prefix, there are three folders:
                - model
                    - The model path that can be loaded in the codebase
                - deployment
                    - The optimised model that can be deployed (may or maynot need this)
                - metrics
                    - The metrics generatted from the model (may or may not need this as this can be contained in the registry)

- Once model build is finished, you can run the `prediction.py` file to generate prediction
    - By default, the prediction pipeline will select the best model based on **mean absolute error** from the model registry
        - This can be overwritten by specifying a model_path, which will load an alternative model
    - There are two ways of getting data into the pipeline:
        - Using the `--data` argument:
            - This is a JSON string which can be passed as `python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'`
                - Note the single and double quotation marks, as this affects the ingestion
        - Using the `--data-path` argument:
            - This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
    - An example of running the file is:
        - `python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet`
    - Outputs of the pipeline are:
        - prediction bucket
        - a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
        - a uprn prefix (i.e 0123456789)
        - a `prediction.json`
        - a `metadata.json`
            - This is all the metadata from the model (can change this if needed)

- NOTE: If you wish to change any settings, these are currently all in the `Settings.py` file
    - It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
        - I.e. the hyperparameters for models are in here but will move into a separate configuration file


# TODO:
- Structure/ MLOps:
    - Add configuration files (dev, staging, prod), including hyperparamters
    - Add precommit hooks (linters, branch names, etc)
    - Sphinx documentation
    - Sort out local mock up services
    - Sort out Model Registry
    - Sort out Data version control
    - pre-commit hooks:
        - The types of hooks that we want (safety, bandit, iso8 etc)
        - The customisations we require
    - Add sphinx documentation
- Data Science:
    - Implement a metrics class, to hold all metric
    - Rebuild metrics script (Could be a one off but good to have)
    - Determine metrics
    - Implement and test custom model (Tensorflow Decision Trees etc)
- Orchestration:
    - Lambda handler for the pipeline