diff --git a/model_data/simulation_system/README.md b/model_data/simulation_system/README.md index e98767ea..ca7e90ee 100644 --- a/model_data/simulation_system/README.md +++ b/model_data/simulation_system/README.md @@ -3,7 +3,7 @@ Starter Readme: Steps for pipeline: -- (WIP) Set up the training development environment +- Set up the training development environment - Change directory to this folder (simulation_system) - Run the following command `make env PYTHON_VERSION=3.10.12` - This will install the specified python version using `pyenv` and select this version as the global python version @@ -56,6 +56,21 @@ Steps for pipeline: - I.e. the hyperparameters for models are in here but will move into a separate configuration file +Data Workflow (DVC): +- We can store artifacts (data/ models) in S3 but can add versioning to these artifacts by leveraging DVC (Not Just Data Version Control) +- How does this work: + - (Initial run): Use the `dvc init` command to turn a git repo into a dvc repo + - This will add a dvc config file and add a gitignore file + - use git to commit these files + - For any data/artifact file that is generated, we use the `dvc add ` command to track the file + - This creates a `.dvc` file that has the metadata of the corresponding artifact + - Now we need to add a remote location to where the files should be stored (or cached version of the data). This is done using the `dvc remote add REMOTE-NAME s3://REMOTE-LOCATION` + - The two remotes set up for this repo are `build-data-remote` and `etl-data-remote`, both pointing to different folders in s3 + - You can use `dvc push` to move the data files to the remote storage locations + - Now if you every delete you data, you can run `dvc pull` and this will download all data back to the relative folders that have `.dvc` files + +- In this repo, this has been setup, so all you will need to do it run `dvc pull` to get all the latest data + # TODO: - Structure/ MLOps: - Add configuration files (dev, staging, prod), including hyperparamters diff --git a/model_data/simulation_system/requirements/training/training-dev.txt b/model_data/simulation_system/requirements/training/training-dev.txt index bcba4f18..a8136586 100644 --- a/model_data/simulation_system/requirements/training/training-dev.txt +++ b/model_data/simulation_system/requirements/training/training-dev.txt @@ -3,3 +3,5 @@ pandas==1.5.3 seaborn==0.12.2 s3fs==2023.6.0 pre-commit==3.3.3 +dvc +dvc[s3] diff --git a/model_data/simulation_system/requirements/training/training.txt b/model_data/simulation_system/requirements/training/training.txt index cb2e1f87..33d86ede 100644 --- a/model_data/simulation_system/requirements/training/training.txt +++ b/model_data/simulation_system/requirements/training/training.txt @@ -2,3 +2,5 @@ autogluon==0.8.2 pandas==1.5.3 seaborn==0.12.2 s3fs==2023.6.0 +dvc +dvc[s3]