mirror of
https://github.com/Hestia-Homes/ML.git
synced 2026-06-08 11:17:25 +00:00
change layout
This commit is contained in:
parent
3223b002c9
commit
40bb9d6f0a
50 changed files with 62 additions and 48 deletions
|
|
@ -20,6 +20,6 @@ repos:
|
|||
hooks:
|
||||
- id: dvc-push-experiment
|
||||
name: DVC - Push to experiment to remote location (experiments)
|
||||
entry: bash -c 'cd modules/ml-pipeline/src/pipeline/src && dvc push -r experiments || echo "Up to date!"'
|
||||
entry: bash -c 'cd modules/ml-pipeline/src/pipeline && dvc push -r experiments || echo "Up to date!"'
|
||||
language: system
|
||||
verbose: true
|
||||
|
|
|
|||
|
|
@ -11,7 +11,7 @@ dev-pyenv:
|
|||
pyenv install ${PYTHON_VERSION} || echo "Python version already installed"
|
||||
pyenv global ${PYTHON_VERSION}
|
||||
python3 -m venv .dev_env_pipeline
|
||||
. .dev_env_pipeline/bin/activate && pip install --upgrade pip && pip install -r src/pipeline/src/requirements/training/requirements-dev.txt && pip install -r src/pipeline/src/requirements/version_control/requirements.txt && pre-commit install
|
||||
. .dev_env_pipeline/bin/activate && pip install --upgrade pip && pip install -r src/pipeline/requirements/training/requirements-dev.txt && pip install -r src/pipeline/requirements/version_control/requirements.txt && pre-commit install
|
||||
echo "TO ACTIVATE ENVIRONMENT, USE THE FOLLOWING COMMAND"
|
||||
echo "source .dev_env_pipeline/bin/activate"
|
||||
|
||||
|
|
|
|||
|
|
@ -1,16 +1,30 @@
|
|||
# ML-pipeline
|
||||
|
||||
This is a dummy ML-pipeline, consisting of:
|
||||
This is a generic ML-pipeline, consisting of:
|
||||
- dvc tracking for version control (data and models)
|
||||
- gto for model registry
|
||||
- docs, created via sphinx (in pre-commit hooks)
|
||||
- tests for unit, integration and end to end testing
|
||||
|
||||
Within `src` folder, the structure is as follows:
|
||||
- multiple pipelines can be defined
|
||||
- i.e. for a product, we might require multuple pipelines do deliver a result
|
||||
- i.e. multiple models
|
||||
- these models can be all tracked within the same gto model registry
|
||||
- `pipeline` folder, which contains all the codebase for the generic pipeline
|
||||
- The pipeline can track multiple models through dvc and gto model registry
|
||||
- Deployment files:
|
||||
- Prediction.Dockerfile - code to create the prediction deployment image
|
||||
- Training.Dockerfil - code to create the training image (i.e. for remote training on EC2/ Fargate)
|
||||
- Docker development environment:
|
||||
- If you wish to develop within a docker.
|
||||
|
||||
# How to develop using this pipeline:
|
||||
|
||||
Run `make init`, which will:
|
||||
- Download pyenv (Python version management)
|
||||
- Download Python 3.X.X as defined in the `make` file - current 3.10.12
|
||||
- Create a virtual environment with this version of python
|
||||
- Install packages in the training and version control directories in the pipeline folder (dev version if applicable)
|
||||
- Install pre-commit to enable pre-commit hooks
|
||||
|
||||
To use the environment, run `source .dev_env_pipeline/bin/activate`.
|
||||
|
||||
To enable the virtual envrionemnt created in vscode:
|
||||
- Open settings
|
||||
|
|
|
|||
3
modules/ml-pipeline/src/README.md
Normal file
3
modules/ml-pipeline/src/README.md
Normal file
|
|
@ -0,0 +1,3 @@
|
|||
# The generic reproducible ML-pipeline
|
||||
|
||||
Pipeline required to build a model to produce an output
|
||||
|
|
@ -1,3 +1,39 @@
|
|||
# Pipeline 1
|
||||
# Training
|
||||
|
||||
Pipeline required to build a model to produce an output
|
||||
This folder contains the code base for training experimentation.
|
||||
|
||||
To understand the pipeline, run `dvc dag`
|
||||
|
||||
There are 4 main steps:
|
||||
- Preparing data
|
||||
- This is loading data (locally or from s3)
|
||||
- Splitting the data into train and validation
|
||||
- Creating additional features (if needed)
|
||||
- **Data is cached**
|
||||
- This will be down to the dvc remote location
|
||||
- Build model
|
||||
- For the prepared data, we build a model using our configurations
|
||||
- Model is saved (locally or s3)
|
||||
- **Model and fit metrics are cached**
|
||||
- This will be down to the dvc remote location
|
||||
- Generate Predictions
|
||||
- For the given model, we generate predictions on validation test data
|
||||
- **Predictions are cached**
|
||||
- This will be down to the dvc remote location
|
||||
- Generate Metrics
|
||||
- For the given model, we generate metrics on validation data/test data
|
||||
- **Metrics are cached**
|
||||
- This will be down to the dvc remote location
|
||||
|
||||
Workflow:
|
||||
- Use `dvc metrics show` to view current metrics score
|
||||
- Adjust parameters/ codebase
|
||||
- When happy with changes, use `dvc exp run` to trigger an experiment
|
||||
- Due to cache, only need stages are re-run
|
||||
- Use `dvc metrics diff` to check the change in metrics
|
||||
- Use `dvc exp show` to view all experiments
|
||||
- NOTE: the last experiment will always be applied to the workspace!
|
||||
- After running experiments, you can apply the the best model to workspace using `dvc exp apply [EXPERIMENT_NAME]`
|
||||
- This experiment will have the corresponding .dvc files for the hashed model and data
|
||||
- Use version control as normal
|
||||
- git add, git commit etc
|
||||
|
|
|
|||
BIN
modules/ml-pipeline/src/pipeline/src/.DS_Store
vendored
BIN
modules/ml-pipeline/src/pipeline/src/.DS_Store
vendored
Binary file not shown.
|
|
@ -1 +0,0 @@
|
|||
{}
|
||||
|
|
@ -1,35 +0,0 @@
|
|||
# Training
|
||||
|
||||
This folder contains the code base for training experimentation.
|
||||
|
||||
To understand the pipeline, run `dvc dag`
|
||||
|
||||
There are 3 main steps:
|
||||
- Preparing data
|
||||
- This is loading data (locally or from s3)
|
||||
- Splitting the data into train and validation
|
||||
- Creating additional features (if needed)
|
||||
- **Data is cached**
|
||||
- This will be down to the dvc remote location
|
||||
- Build model
|
||||
- For the prepared data, we build a model using our configurations
|
||||
- Model is saved (locally or s3)
|
||||
- **Model is cached**
|
||||
- This will be down to the dvc remote location
|
||||
- Generate Metrics
|
||||
- For the given model, we generate metrics on validation data/test data
|
||||
- **Metrics are cached**
|
||||
- This will be down to the dvc remote location
|
||||
|
||||
Workflow:
|
||||
- Use `dvc metrics show` to view current metrics score
|
||||
- Adjust parameters/ codebase
|
||||
- When happy with changes, use `dvc exp run` to trigger an experiment
|
||||
- Due to cache, only need stages are re-run
|
||||
- Use `dvc metrics diff` to check the change in metrics
|
||||
- Use `dvc exp show` to view all experiments
|
||||
- NOTE: the last experiment will always be applied to the workspace!
|
||||
- After running experiments, you can apply the the best model to workspace using `dvc exp apply [EXPERIMENT_NAME]`
|
||||
- This experiment will have the corresponding .dvc files for the hashed model and data
|
||||
- Use version control as normal
|
||||
- git add, git commit etc
|
||||
|
|
@ -1,3 +0,0 @@
|
|||
# Pipeline2
|
||||
|
||||
PLACEHOLDER PIPELINE IF NEEDED
|
||||
Loading…
Add table
Reference in a new issue