Model/model_data/simulation_system
2023-08-19 17:04:57 +00:00
..
core Add requirements file 2023-08-18 12:46:13 +01:00
MLModel Add requirements file 2023-08-18 12:46:13 +01:00
model_build_data/change_data/rdsap_full Added split training data 2023-08-15 23:13:34 +00:00
preprocessed_data add a test data generation script 2023-08-12 19:11:22 +00:00
__init__.py Adding database connection to fastapi 2023-07-31 11:02:22 +01:00
docker-compose.yml Add readme and a better folder output structure. Starter code for mock up s3 2023-08-18 10:18:17 +01:00
energy_predictor.py Add readme and a better folder output structure. Starter code for mock up s3 2023-08-18 10:18:17 +01:00
generate_rdsap_change.py added mlmodel, prediction and training files 2023-08-17 16:07:22 +01:00
Makefile Add readme and a better folder output structure. Starter code for mock up s3 2023-08-18 10:18:17 +01:00
predictions.py Add requirements file 2023-08-18 12:46:13 +01:00
README.md cleaned up settings 2023-08-18 11:09:21 +01:00
requirements.txt add requirements for seaborn 2023-08-19 13:57:45 +00:00
test_data_generation.py added mlmodel, prediction and training files 2023-08-17 16:07:22 +01:00
training.py remove residual output at base folder 2023-08-19 17:04:57 +00:00

Simulation System

Starter Readme: Steps for pipeline:

  • (WIP) Use Makefile to start up mock up s3 service

    • By running make init, this will run the docker-compose build and docker-compose up -d, which spins up a S3 service
    • This docker compose is running in detached mode -d, so will no output anything to the terminal
  • Once the Minio service is run, you can run the training.py file to start a model build process

    • This will output a model, for a given target column, and add model name composed of some of the hyperparameters
    • An example of running this file is:
      • python3 training.py --train-filepath ./model_build_data/change_data/rdsap_full/train_validation_data.parquet --test-filepath ./model_build_data/change_data/rdsap_full/test_data.parquet
    • Outputs of the pipeline are:
      • A model directory bucket
      • A target variable prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
      • A model type prefix (i.e. autogluon, tensorflow etc)
      • A model name prefix (i.e. rdsap_change_medium_quality_60_TIMESTAMP)
        • This model name is made up of target variable, quality, time spent training and timestamp
        • Within this prefix, there are three folders:
          • model
            • The model path that can be loaded in the codebase
          • deployment
            • The optimised model that can be deployed (may or maynot need this)
          • metrics
            • The metrics generatted from the model (may or may not need this as this can be contained in the registry)
  • Once model build is finished, you can run the prediction.py file to generate prediction

    • By default, the prediction pipeline will select the best model based on mean absolute error from the model registry
      • This can be overwritten by specifying a model_path, which will load an alternative model
    • There are two ways of getting data into the pipeline:
      • Using the --data argument:
        • This is a JSON string which can be passed as python3 predictions.py --data '{"TOTAL_FLOOR_AREA": 1}'
          • Note the single and double quotation marks, as this affects the ingestion
      • Using the --data-path argument:
        • This can be a filepath (Can imagine that we might want to pull data from S3/ DB)
    • An example of running the file is:
      • python3 predictions.py --data-path ../simulation_system/model_build_data/change_data/rdsap_full/test_data.parquet
    • Outputs of the pipeline are:
      • prediction bucket
      • a Target variables prefix (i.e. RDSAP_CHANGE or HEAT_DEMAND_CHANGE)
      • a uprn prefix (i.e 0123456789)
      • a prediction.json
      • a metadata.json
        • This is all the metadata from the model (can change this if needed)
  • NOTE: If you wish to change any settings, these are currently all in the Settings.py file

    • It will be separated out eventually but for now, it works to keep track of anything that we might want to respecify.
      • I.e. the hyperparameters for models are in here but will move into a separate configuration file

TODO:

  • Structure/ MLOps:
    • Add configuration files (dev, staging, prod), including hyperparamters
    • Add precommit hooks (linters, branch names, etc)
    • Sphinx documentation
    • Sort out local mock up services
    • Sort out Model Registry
    • Sort out Data version control
  • Data Science:
    • Implement a metrics class, to hold all metric
    • Rebuild metrics script (Could be a one off but good to have)
    • Determine metrics
    • Implement and test custom model (Tensorflow Decision Trees etc)
  • Orchestration:
    • Lambda handler for the pipeline