# ML-pipeline

This is a generic ML-pipeline, consisting of:
- dvc tracking for version control (data and models)
- gto for model registry
- docs, created via sphinx (in pre-commit hooks)
- tests for unit, integration and end to end testing

Within `src` folder, the structure is as follows:
- `pipeline` folder, which contains all the codebase for the generic pipeline
    - The pipeline can track multiple models through dvc and gto model registry
- Deployment files:
    - Prediction.Dockerfile - code to create the prediction deployment image
    - Training.Dockerfil - code to create the training image (i.e. for remote training on EC2/ Fargate)
- Docker development environment:
    - If you wish to develop within a docker.

# How to develop using this pipeline:

Run `make init`, which will:
- Download pyenv (Python version management)
- Download Python 3.X.X as defined in the `make` file - current 3.10.12
- Create a virtual environment with this version of python
- Install packages in the training and version control directories in the pipeline folder (dev version if applicable)
- Install pre-commit to enable pre-commit hooks

To use the environment, run `source .dev_env_pipeline/bin/activate`.

To enable the virtual envrionemnt created in vscode:
- Open settings
- Search 'env'
- Under the extensions tab, there will be **Venv path**
- Copy the path of the '.dev_env' folder into there.
- When you select a kernel, clcik through create environment and refresh
- The virutal environment should be there

To use the docker environment for coding in VSCODE:
- Open the "pipeline" folder
- Open with remote container
- Select the Dockerfile
- Add the Git extension (for dvc)

For running experiment, everything will be cached but the workflow will be:
- `dvc repro` to regenerate the current experiement 
- Change parameters if needed
- Use `dvc exp run`
- Cachec the results by using `dvc push -r REMOTE_NAME`
- Repeat as needed
- When happy with results, use `dvc exp apply EXPERIMENT_NAME`
- Use `dvc pull` 
- Commit code