changed readme

This commit is contained in:
Michael Duong 2023-09-04 17:48:25 +01:00
parent 98ff1b6fc5
commit 91605127f2
3 changed files with 20 additions and 1 deletions

View file

@ -3,7 +3,7 @@
Starter Readme:
Steps for pipeline:
- (WIP) Set up the training development environment
- Set up the training development environment
- Change directory to this folder (simulation_system)
- Run the following command `make env PYTHON_VERSION=3.10.12`
- This will install the specified python version using `pyenv` and select this version as the global python version
@ -56,6 +56,21 @@ Steps for pipeline:
- I.e. the hyperparameters for models are in here but will move into a separate configuration file
Data Workflow (DVC):
- We can store artifacts (data/ models) in S3 but can add versioning to these artifacts by leveraging DVC (Not Just Data Version Control)
- How does this work:
- (Initial run): Use the `dvc init` command to turn a git repo into a dvc repo
- This will add a dvc config file and add a gitignore file
- use git to commit these files
- For any data/artifact file that is generated, we use the `dvc add <FILE/Directory>` command to track the file
- This creates a `<FILE/directory>.dvc` file that has the metadata of the corresponding artifact
- Now we need to add a remote location to where the files should be stored (or cached version of the data). This is done using the `dvc remote add REMOTE-NAME s3://REMOTE-LOCATION`
- The two remotes set up for this repo are `build-data-remote` and `etl-data-remote`, both pointing to different folders in s3
- You can use `dvc push` to move the data files to the remote storage locations
- Now if you every delete you data, you can run `dvc pull` and this will download all data back to the relative folders that have `.dvc` files
- In this repo, this has been setup, so all you will need to do it run `dvc pull` to get all the latest data
# TODO:
- Structure/ MLOps:
- Add configuration files (dev, staging, prod), including hyperparamters

View file

@ -3,3 +3,5 @@ pandas==1.5.3
seaborn==0.12.2
s3fs==2023.6.0
pre-commit==3.3.3
dvc
dvc[s3]

View file

@ -2,3 +2,5 @@ autogluon==0.8.2
pandas==1.5.3
seaborn==0.12.2
s3fs==2023.6.0
dvc
dvc[s3]