Model/model_data/simulation_system/README.md
2023-09-04 17:48:25 +01:00

5.6 KiB

Simulation System

Starter Readme: Steps for pipeline:

  • Set up the training development environment

    • Change directory to this folder (simulation_system)
    • Run the following command make env PYTHON_VERSION=3.10.12
    • This will install the specified python version using pyenv and select this version as the global python version
    • It will install all training packages as specified in the training-dev.txt requirements file, including the pre-commit hooks
    • Run source .training_env/bin/activate to use this environment
  • (WIP) Use Makefile to start up mock up s3 service

    • By running make init, this will run the docker-compose build and docker-compose up -d, which spins up a S3 service
    • This docker compose is running in detached mode -d, so will no output anything to the terminal
  • Once the Minio service is run, you can run the training.py file to start a model build process

    • This will output a model, for a given target column, and add model name composed of some of the hyperparameters
    • An example of running this file is:
      • python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet
    • Outputs of the pipeline are:
      • A model directory bucket
      • A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
      • A model type prefix (i.e. autogluon, tensorflow etc)
      • A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
        • This model name is made up of target variable, quality, time spent training and timestamp
        • Within this prefix, there are three folders:
          • model
            • The model path that can be loaded in the codebase
          • deployment
            • The optimised model that can be deployed (may or maynot need this)
          • metrics
            • The metrics generatted from the model (may or may not need this as this can be contained in the registry)
  • Once model build is finished, you can run the prediction.py file to generate prediction

    • By default, the prediction pipeline will select the best model based on mean absolute error from the model registry
      • This can be overwritten by specifying a model_path, which will load an alternative model
    • There are two ways of getting data into the pipeline:
      • Using the --data argument:
        • This is a JSON string which can be passed as python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'
          • Note the single and double quotation marks, as this affects the ingestion
      • Using the --data-path argument:
        • This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
    • An example of running the file is:
      • python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet
    • Outputs of the pipeline are:
      • prediction bucket
      • a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
      • a uprn prefix (i.e 0123456789)
      • a prediction.json
      • a metadata.json
        • This is all the metadata from the model (can change this if needed)
  • NOTE: If you wish to change any settings, these are currently all in the Settings.py file

    • It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
      • I.e. the hyperparameters for models are in here but will move into a separate configuration file

Data Workflow (DVC):

  • We can store artifacts (data/ models) in S3 but can add versioning to these artifacts by leveraging DVC (Not Just Data Version Control)

  • How does this work:

    • (Initial run): Use the dvc init command to turn a git repo into a dvc repo
      • This will add a dvc config file and add a gitignore file
        • use git to commit these files
    • For any data/artifact file that is generated, we use the dvc add <FILE/Directory> command to track the file
      • This creates a <FILE/directory>.dvc file that has the metadata of the corresponding artifact
    • Now we need to add a remote location to where the files should be stored (or cached version of the data). This is done using the dvc remote add REMOTE-NAME s3://REMOTE-LOCATION
      • The two remotes set up for this repo are build-data-remote and etl-data-remote, both pointing to different folders in s3
    • You can use dvc push to move the data files to the remote storage locations
    • Now if you every delete you data, you can run dvc pull and this will download all data back to the relative folders that have .dvc files
  • In this repo, this has been setup, so all you will need to do it run dvc pull to get all the latest data

TODO:

  • Structure/ MLOps:
    • Add configuration files (dev, staging, prod), including hyperparamters
    • Add precommit hooks (linters, branch names, etc)
    • Sphinx documentation
    • Sort out local mock up services
    • Sort out Model Registry
    • Sort out Data version control
    • pre-commit hooks:
      • The types of hooks that we want (safety, bandit, iso8 etc)
      • The customisations we require
    • Add sphinx documentation
  • Data Science:
    • Implement a metrics class, to hold all metric
    • Rebuild metrics script (Could be a one off but good to have)
    • Determine metrics
    • Implement and test custom model (Tensorflow Decision Trees etc)
  • Orchestration:
    • Lambda handler for the pipeline