change layout

This commit is contained in:
Michael Duong 2023-09-18 19:51:16 +01:00
parent 3223b002c9
commit 40bb9d6f0a
50 changed files with 62 additions and 48 deletions

View file

@ -20,6 +20,6 @@ repos:
hooks:
- id: dvc-push-experiment
name: DVC - Push to experiment to remote location (experiments)
entry: bash -c 'cd modules/ml-pipeline/src/pipeline/src && dvc push -r experiments || echo "Up to date!"'
entry: bash -c 'cd modules/ml-pipeline/src/pipeline && dvc push -r experiments || echo "Up to date!"'
language: system
verbose: true

View file

@ -11,7 +11,7 @@ dev-pyenv:
pyenv install ${PYTHON_VERSION} || echo "Python version already installed"
pyenv global ${PYTHON_VERSION}
python3 -m venv .dev_env_pipeline
. .dev_env_pipeline/bin/activate && pip install --upgrade pip && pip install -r src/pipeline/src/requirements/training/requirements-dev.txt && pip install -r src/pipeline/src/requirements/version_control/requirements.txt && pre-commit install
. .dev_env_pipeline/bin/activate && pip install --upgrade pip && pip install -r src/pipeline/requirements/training/requirements-dev.txt && pip install -r src/pipeline/requirements/version_control/requirements.txt && pre-commit install
echo "TO ACTIVATE ENVIRONMENT, USE THE FOLLOWING COMMAND"
echo "source .dev_env_pipeline/bin/activate"

View file

@ -1,16 +1,30 @@
# ML-pipeline
This is a dummy ML-pipeline, consisting of:
This is a generic ML-pipeline, consisting of:
- dvc tracking for version control (data and models)
- gto for model registry
- docs, created via sphinx (in pre-commit hooks)
- tests for unit, integration and end to end testing
Within `src` folder, the structure is as follows:
- multiple pipelines can be defined
- i.e. for a product, we might require multuple pipelines do deliver a result
- i.e. multiple models
- these models can be all tracked within the same gto model registry
- `pipeline` folder, which contains all the codebase for the generic pipeline
- The pipeline can track multiple models through dvc and gto model registry
- Deployment files:
- Prediction.Dockerfile - code to create the prediction deployment image
- Training.Dockerfil - code to create the training image (i.e. for remote training on EC2/ Fargate)
- Docker development environment:
- If you wish to develop within a docker.
# How to develop using this pipeline:
Run `make init`, which will:
- Download pyenv (Python version management)
- Download Python 3.X.X as defined in the `make` file - current 3.10.12
- Create a virtual environment with this version of python
- Install packages in the training and version control directories in the pipeline folder (dev version if applicable)
- Install pre-commit to enable pre-commit hooks
To use the environment, run `source .dev_env_pipeline/bin/activate`.
To enable the virtual envrionemnt created in vscode:
- Open settings

View file

@ -0,0 +1,3 @@
# The generic reproducible ML-pipeline
Pipeline required to build a model to produce an output

View file

@ -1,3 +1,39 @@
# Pipeline 1
# Training
Pipeline required to build a model to produce an output
This folder contains the code base for training experimentation.
To understand the pipeline, run `dvc dag`
There are 4 main steps:
- Preparing data
- This is loading data (locally or from s3)
- Splitting the data into train and validation
- Creating additional features (if needed)
- **Data is cached**
- This will be down to the dvc remote location
- Build model
- For the prepared data, we build a model using our configurations
- Model is saved (locally or s3)
- **Model and fit metrics are cached**
- This will be down to the dvc remote location
- Generate Predictions
- For the given model, we generate predictions on validation test data
- **Predictions are cached**
- This will be down to the dvc remote location
- Generate Metrics
- For the given model, we generate metrics on validation data/test data
- **Metrics are cached**
- This will be down to the dvc remote location
Workflow:
- Use `dvc metrics show` to view current metrics score
- Adjust parameters/ codebase
- When happy with changes, use `dvc exp run` to trigger an experiment
- Due to cache, only need stages are re-run
- Use `dvc metrics diff` to check the change in metrics
- Use `dvc exp show` to view all experiments
- NOTE: the last experiment will always be applied to the workspace!
- After running experiments, you can apply the the best model to workspace using `dvc exp apply [EXPERIMENT_NAME]`
- This experiment will have the corresponding .dvc files for the hashed model and data
- Use version control as normal
- git add, git commit etc

Binary file not shown.

View file

@ -1 +0,0 @@
{}

View file

@ -1,35 +0,0 @@
# Training
This folder contains the code base for training experimentation.
To understand the pipeline, run `dvc dag`
There are 3 main steps:
- Preparing data
- This is loading data (locally or from s3)
- Splitting the data into train and validation
- Creating additional features (if needed)
- **Data is cached**
- This will be down to the dvc remote location
- Build model
- For the prepared data, we build a model using our configurations
- Model is saved (locally or s3)
- **Model is cached**
- This will be down to the dvc remote location
- Generate Metrics
- For the given model, we generate metrics on validation data/test data
- **Metrics are cached**
- This will be down to the dvc remote location
Workflow:
- Use `dvc metrics show` to view current metrics score
- Adjust parameters/ codebase
- When happy with changes, use `dvc exp run` to trigger an experiment
- Due to cache, only need stages are re-run
- Use `dvc metrics diff` to check the change in metrics
- Use `dvc exp show` to view all experiments
- NOTE: the last experiment will always be applied to the workspace!
- After running experiments, you can apply the the best model to workspace using `dvc exp apply [EXPERIMENT_NAME]`
- This experiment will have the corresponding .dvc files for the hashed model and data
- Use version control as normal
- git add, git commit etc

View file

@ -1,3 +0,0 @@
# Pipeline2
PLACEHOLDER PIPELINE IF NEEDED