Simulation System

Starter Readme: Steps for pipeline:

Set up the training development environment
- Change directory to this folder (simulation_system)
- Run the following command make env PYTHON_VERSION=3.10.12
- This will install the specified python version using pyenv and select this version as the global python version
- It will install all training packages as specified in the training-dev.txt requirements file, including the pre-commit hooks
- Run source .training_env/bin/activate to use this environment
(WIP) Use Makefile to start up mock up s3 service
- By running make init, this will run the docker-compose build and docker-compose up -d, which spins up a S3 service
- This docker compose is running in detached mode -d, so will no output anything to the terminal
Once the Minio service is run, you can run the training.py file to start a model build process
- This will output a model, for a given target column, and add model name composed of some of the hyperparameters
- An example of running this file is:
  - python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
  - A model directory bucket
  - A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
  - A model type prefix (i.e. autogluon, tensorflow etc)
  - A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
    - This model name is made up of target variable, quality, time spent training and timestamp
    - Within this prefix, there are three folders:
      - model
        
        The model path that can be loaded in the codebase
      - deployment
        
        The optimised model that can be deployed (may or maynot need this)
      - metrics
        
        The metrics generatted from the model (may or may not need this as this can be contained in the registry)
Once model build is finished, you can run the prediction.py file to generate prediction
- By default, the prediction pipeline will select the best model based on mean absolute error from the model registry
  - This can be overwritten by specifying a model_path, which will load an alternative model
- There are two ways of getting data into the pipeline:
  - Using the --data argument:
    - This is a JSON string which can be passed as python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'
      - Note the single and double quotation marks, as this affects the ingestion
  - Using the --data-path argument:
    - This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
- An example of running the file is:
  - python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet
- Outputs of the pipeline are:
  - prediction bucket
  - a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
  - a uprn prefix (i.e 0123456789)
  - a prediction.json
  - a metadata.json
    - This is all the metadata from the model (can change this if needed)
NOTE: If you wish to change any settings, these are currently all in the Settings.py file
- It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
  - I.e. the hyperparameters for models are in here but will move into a separate configuration file

Data Workflow (DVC):

We can store artifacts (data/ models) in S3 but can add versioning to these artifacts by leveraging DVC (Not Just Data Version Control)
How does this work:
- (Initial run): Use the dvc init command to turn a git repo into a dvc repo
  - This will add a dvc config file and add a gitignore file
    - use git to commit these files
- For any data/artifact file that is generated, we use the dvc add <FILE/Directory> command to track the file
  - This creates a <FILE/directory>.dvc file that has the metadata of the corresponding artifact
- Now we need to add a remote location to where the files should be stored (or cached version of the data). This is done using the dvc remote add REMOTE-NAME s3://REMOTE-LOCATION
  - The two remotes set up for this repo are build-data-remote and etl-data-remote, both pointing to different folders in s3
- You can use dvc push to move the data files to the remote storage locations
- Now if you every delete you data, you can run dvc pull and this will download all data back to the relative folders that have .dvc files
In this repo, this has been setup, so all you will need to do it run dvc pull to get all the latest data

TODO:

Structure/ MLOps:
- Add configuration files (dev, staging, prod), including hyperparamters
- Add precommit hooks (linters, branch names, etc)
- Sphinx documentation
- Sort out local mock up services
- Sort out Model Registry
- Sort out Data version control
- pre-commit hooks:
  - The types of hooks that we want (safety, bandit, iso8 etc)
  - The customisations we require
- Add sphinx documentation
Data Science:
- Implement a metrics class, to hold all metric
- Rebuild metrics script (Could be a one off but good to have)
- Determine metrics
- Implement and test custom model (Tensorflow Decision Trees etc)
Orchestration:
- Lambda handler for the pipeline

5.6 KiB Raw Blame History

Simulation System

TODO:

5.6 KiB

Raw Blame History