July 17, 2024


Data Science Cloud Infrastructure

Photo by Guillaume Bolduc on Unsplash

In our previous post, we saw how to configure AWS Batch and tested our infrastructure by executing a task that spinned up a container, waited for 3 seconds and shut down.

In this post, we’ll leverage the existing infrastructure, but this time, we’ll execute a more interesting example.

We’ll ship our code to AWS by building a container and storing it in Amazon ECR, a service that allows us to store Docker images.

If you want to keep up-to-date with my Data Science content. Follow me on Medium or Twitter. Thanks for reading!

We’ll be using the aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:

We’ll be using Docker for this part, so ensure it’s up and running:


We first create a repository, which will host our Docker images:

Create ECR repository:


The command above will print the repository URI, assign it to the next variable since we’ll need it later:

We’ll now use two open source tools ( Ploomber, and Soopervisor) to write our computational task, generate a Docker image, push it to ECR, and schedule a job in AWS Batch.

Let’s install the packages:

Note: We recommend you installing them in a virtual environment.

Let’s get an example. This example trains and evaluates a Machine Learning model:


Let’s see the files:


The structure is a typical Ploomber project. Ploomber allows you to easily organize computational workflows as functions, scripts or notebooks and execute them locally. To learn more check out Ploomber’s documentation.

On the other hand, Soopervisor allows you to export a Ploomber project and execute it in the cloud.

The next command will tell Soopervisor to create the necessary files so we can export to AWS Batch:


===================Loading DAG===================
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead
Found /Users/Edu/dev/ploomber.io/raw/ds-platform-part-ii/example/pipeline.yaml.
Loading... Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-ii/example/aws-env/Dockerfile...
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it: $ soopervisor export aws-env
To force execution of all tasks: $ soopervisor export aws-env --mode force

soopervisor add will create a soopervisor.yaml file and a aws-batch folder.

The aws-batch folder contains a Dockerfile (which we need to create a Docker image):


The soopervisor.yaml file contains configuration parameters:


There are a few parameters we have to configure here, we created a small script to generate the configuration file:

  • job_queue: the name of your job queue
  • aws_region: the region where your AWS Batch infrastructure is located
  • repository: the ECR repository URI

Here are the values for my infrastructure (replace it with yours):

Note: If you don’t have the job queue name, you can get it from the AWS console (ensure you’re in the right region).

Let’s download a utility script to facilitate creating the configuration files:


Create the soopervisor.yaml configuration file:


This is how the file looks like:


Let’s now use soopervisor export to execute the command in AWS Batch. Such command will do a few things for us:

  • Build the Docker container
  • Push it to the Amazon ECR repository
  • Submit the jobs to AWS Batch

We need to install boto3 since it’s a dependency to submit jobs to AWS Batch:

Authenticate with Amazon ECR so we can push images:


Let’s now export the project. Bear in mind that this command will take a few minutes:

If all goes well, you’ll see something like this:

If you encounter issues with the soopervisor export command, or are unable to push to ECR, join our community and we’ll help you!

Once the command finishes execution, the job will be submitted to AWS Batch. Let’s use the aws CLI to list the jobs submitted to the queue:


After a a minute, you’ll see that task shows as SUCCEEDED (it’ll appear as RUNNABLE, STARTING or RUNNING if it hasn’t finished).

However, there is a catch: AWS Batch ran our code but shortly after, it shut down the EC2 instance, hence, we no longer have access to the output.

To fix that, we’ll add an S3 client to our project, so all outputs are stored.

Let’s first create a bucket in S3. S3 bucket names must be unique, you can run the following snippet in your terminal or choose a unique name and assign it to the BUCKET_NAME variable:


Create bucket:


Ploomber allows us to specify an S3 bucket and it’ll take care of uploading all outputs for us. We only have to create a short file. The generate.py script can create one for us:


We need to configure our pipeline.yaml file so it uploads artifacts to S3. Let’s use the generate.py file so it does it for us:

Furthermore, let’s add boto3 to our dependencies since we’ll be calling it to upload artifacts to S3:

Let’s add S3 permissions to our AWS Batch tasks. Generate a policy:


Apply it:

We’re now ready to execute our task on AWS Batch!

Let’s ensure we can push to ECR:


Submit the task again:

Note that this time, the soopervisor export command is a lot faster, since it cached our Docker image!

Let’s check the status of the task:


After a minute, you should see it as SUCCEEDED.

Check the contents of our bucket, we’ll see the task output (a .parquet file):


In this post, we learned how to upload our code and execute it in AWS Batch via a Docker image. We also configured AWS Batch to read and write an S3 bucket. With this configuration, we can start running Data Science experiments in a scalable way without worrying about maintaining infrastructure!

In next (and final) post of this series, we’ll see how to easily generate hundreds of experiments and retrieve the results.

If you want to be the first to know when the final part comes out; follow us on Twitter, LinkedIn, or subscribe to our newsletter!

If you wish to delete the infrastructure we created in this post, here are the commands.

Delete ECR repository:


Delete S3 bucket:


Deploying a Data Science Platform on AWS: Running containerized experiments (Part II) Republished from Source https://towardsdatascience.com/deploying-a-data-science-platform-on-aws-running-containerized-experiments-part-ii-bef0e22bd8ae?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed




Source link