February 6, 2025

[ad_1]

A Perfect Combo to Quickly Iterate on Your DS Project

Imagine your data pipeline looks similar to the graph below.

Image by Author

The pink box represents a stage, which is an individual data process. Dependencies are the files that a stage depends on such as parameters, Python scripts, or input data.

Now imagine Dependencies 2 changes. The standard approach is to rerun the entire pipeline again.

Image by Author

This approach works but is inefficient. Wouldn’t it be better to run only the stages whose dependencies changed?

Image by Author

That is when the combination of DVC and GitHub Actions comes in handy. In this article, you will learn how to:

  • Use GitHub Actions to run a workflow when you push a commit
  • Use DVC to run stages with modified dependencies

Ultimately, combining these two tools will help reduce the friction and the amount of time needed to experiment with different parameters, code, or data.

Image by Author

The code used in this article can be found here:

DVC is a system for data version control. It is essentially like Git but is used for data.

DVC pipelines allow you to specify the individual data processes (called stages) that produce a final result.

Pipeline Stages

Let’s create a DVC pipeline by creating two stages in the file dvc.yaml. In summary,

  • The process_data stage processes raw data
  • The train stage trains the processed data

Details of the code above:

  • cmd specifies the command to run the stage
  • deps specifies the files the stage depends on
  • params specifies a special kind of dependency: parameter
  • outs specifies the directory for the outputs of the stage
  • plots specifies a special kind of output: plot

Reproduce

To run the pipeline in dvc.yaml, type:

dvc repro

Output:

The first time you run this command, DVC:

  • Runs every stage in the pipeline
  • Caches the run’s results
  • Creates the dvc.lock file. This file describes what data to use and which commands to generate the pipeline results.
Image by Author

Now let’s say we change the src/segment.py file, which is dependent of the train stage. When you run dvc repro again, you will see the following:

From the output, we can see that DVC only runs the train stage because it:

  • Detects changes in the train stage
  • Doesn’t detect changes in the process_data stage.
Image by Author

This prevents us from wasting time on unnecessary reruns.

To track the changes in the pipeline with Git, run:

git add dvc.lock

To send the updates to the remote storage, type:

dvc push

GitHub Actions allows you to automate your workflows, making it faster to build, test, and deploy your code.

We will use GitHub Actions to run the DVC pipeline when committing the changes to GitHub.

Start with creating a file called run_pipline.yaml under the .github/workflows directory:

.github
└── workflows
└── run_pipeline.yaml

This is how the run_pipeline.yaml file looks like:

Let’s understand the file above by breaking it down.

The first part of the file specifies the events that cause a workflow to run. Here, we tell GitHub Actions that the Run code workflow is triggered when:

  • A commit is pushed to the dvc-pipeline branch
  • The push includes the changes to the files in the configs , src , and data directories

A workflow run is made up of one or more jobs. A job includes a set of steps that are executed in order. The second part of the file species the steps inside the run_code job.

After finishing writing the workflow, push the file to GitHub.

Now let’s try out the workflow by changing the file src/segment.py and pushing it to GitHub.

git add .
git commit -m 'edit segment.py'
git push origin dvc-pipeline

When you click the Actions tab in your repository, you should see a new workflow run called edit segment.py .

Image by Author

Click the run to see more details about which step is running.

Image by Author

Congratulations! We have just succeeded in using GitHub Actions and DVC to:

  • Run the workflow when changes are pushed to GitHub
  • Rerun only stages with modified dependencies

If you are a data practitioner who is looking for a faster way to iterate on your data science project, I encourage you to give this a try. With a little bit of initial setup, you will save a lot of time for your team in a long run.

DVC + GitHub Actions: Automatically Rerun Modified Components of a Pipeline Republished from Source https://towardsdatascience.com/dvc-github-actions-automatically-rerun-modified-components-of-a-pipeline-a3632519dc42?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>

[ad_2]

Source link