ML/modules/ml-pipeline
2024-10-25 23:53:10 +01:00
..
docs first pass at sphinx but stoped due to open issue 2023-09-10 15:52:33 +01:00
src add model with removed floor u values 2024-10-25 23:53:10 +01:00
tests add dvc and gto 2023-09-09 09:52:37 +00:00
.gitignore Got deployment working 2023-10-03 17:08:48 +01:00
.pre-commit-config.yaml change layout 2023-09-18 19:51:16 +01:00
Makefile Update Makefile 2023-10-17 03:08:01 +01:00
README.MD change layout 2023-09-18 19:51:16 +01:00

ML-pipeline

This is a generic ML-pipeline, consisting of:

  • dvc tracking for version control (data and models)
  • gto for model registry
  • docs, created via sphinx (in pre-commit hooks)
  • tests for unit, integration and end to end testing

Within src folder, the structure is as follows:

  • pipeline folder, which contains all the codebase for the generic pipeline
    • The pipeline can track multiple models through dvc and gto model registry
  • Deployment files:
    • Prediction.Dockerfile - code to create the prediction deployment image
    • Training.Dockerfil - code to create the training image (i.e. for remote training on EC2/ Fargate)
  • Docker development environment:
    • If you wish to develop within a docker.

How to develop using this pipeline:

Run make init, which will:

  • Download pyenv (Python version management)
  • Download Python 3.X.X as defined in the make file - current 3.10.12
  • Create a virtual environment with this version of python
  • Install packages in the training and version control directories in the pipeline folder (dev version if applicable)
  • Install pre-commit to enable pre-commit hooks

To use the environment, run source .dev_env_pipeline/bin/activate.

To enable the virtual envrionemnt created in vscode:

  • Open settings
  • Search 'env'
  • Under the extensions tab, there will be Venv path
  • Copy the path of the '.dev_env' folder into there.
  • When you select a kernel, clcik through create environment and refresh
  • The virutal environment should be there

To use the docker environment for coding in VSCODE:

  • Open the "pipeline" folder
  • Open with remote container
  • Select the Dockerfile
  • Add the Git extension (for dvc)

For running experiment, everything will be cached but the workflow will be:

  • dvc repro to regenerate the current experiement
  • Change parameters if needed
  • Use dvc exp run
  • Cachec the results by using dvc push -r REMOTE_NAME
  • Repeat as needed
  • When happy with results, use dvc exp apply EXPERIMENT_NAME
  • Use dvc pull
  • Commit code