[ad_1]
A Perfect Combo to Quickly Iterate on Your DS Project
Imagine your data pipeline looks similar to the graph below.
The pink box represents a stage, which is an individual data process. Dependencies are the files that a stage depends on such as parameters, Python scripts, or input data.
Now imagine Dependencies 2
changes. The standard approach is to rerun the entire pipeline again.
This approach works but is inefficient. Wouldn’t it be better to run only the stages whose dependencies changed?
That is when the combination of DVC and GitHub Actions comes in handy. In this article, you will learn how to:
- Use GitHub Actions to run a workflow when you push a commit
- Use DVC to run stages with modified dependencies
Ultimately, combining these two tools will help reduce the friction and the amount of time needed to experiment with different parameters, code, or data.
The code used in this article can be found here:
DVC is a system for data version control. It is essentially like Git but is used for data.
DVC pipelines allow you to specify the individual data processes (called stages) that produce a final result.
Pipeline Stages
Let’s create a DVC pipeline by creating two stages in the file dvc.yaml
. In summary,
- The
process_data
stage processes raw data - The
train
stage trains the processed data
Details of the code above:
cmd
specifies the command to run the stagedeps
specifies the files the stage depends onparams
specifies a special kind of dependency: parameterouts
specifies the directory for the outputs of the stageplots
specifies a special kind of output: plot
Reproduce
To run the pipeline in dvc.yaml
, type:
dvc repro
Output:
The first time you run this command, DVC:
- Runs every stage in the pipeline
- Caches the run’s results
- Creates the
dvc.lock
file. This file describes what data to use and which commands to generate the pipeline results.
Now let’s say we change the src/segment.py
file, which is dependent of the train
stage. When you run dvc repro
again, you will see the following:
From the output, we can see that DVC only runs the train
stage because it:
- Detects changes in the
train
stage - Doesn’t detect changes in the
process_data
stage.
This prevents us from wasting time on unnecessary reruns.
To track the changes in the pipeline with Git, run:
git add dvc.lock
To send the updates to the remote storage, type:
dvc push
GitHub Actions allows you to automate your workflows, making it faster to build, test, and deploy your code.
We will use GitHub Actions to run the DVC pipeline when committing the changes to GitHub.
Start with creating a file called run_pipline.yaml
under the .github/workflows
directory:
.github
└── workflows
└── run_pipeline.yaml
This is how the run_pipeline.yaml
file looks like:
Let’s understand the file above by breaking it down.
The first part of the file specifies the events that cause a workflow to run. Here, we tell GitHub Actions that the Run code
workflow is triggered when:
- A commit is pushed to the
dvc-pipeline
branch - The push includes the changes to the files in the
configs
,src
, anddata
directories
A workflow run is made up of one or more jobs. A job includes a set of steps that are executed in order. The second part of the file species the steps inside the run_code
job.
After finishing writing the workflow, push the file to GitHub.
Now let’s try out the workflow by changing the file src/segment.py
and pushing it to GitHub.
git add .
git commit -m 'edit segment.py'
git push origin dvc-pipeline
When you click the Actions tab in your repository, you should see a new workflow run called edit segment.py
.
Click the run to see more details about which step is running.
Congratulations! We have just succeeded in using GitHub Actions and DVC to:
- Run the workflow when changes are pushed to GitHub
- Rerun only stages with modified dependencies
If you are a data practitioner who is looking for a faster way to iterate on your data science project, I encourage you to give this a try. With a little bit of initial setup, you will save a lot of time for your team in a long run.
DVC + GitHub Actions: Automatically Rerun Modified Components of a Pipeline Republished from Source https://towardsdatascience.com/dvc-github-actions-automatically-rerun-modified-components-of-a-pipeline-a3632519dc42?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed
<!–
–>
[ad_2]
Source link