July 18, 2024

[ad_1]

Parallelization does not have to be difficult

Parallelization in Python does not have to be difficult. Photo by Abbas Tehrani on Unsplash

Many beginners and intermediate Python developers are afraid of parallelization. To them, parallel code means difficult code. Processes, threads, greenlets, coroutines… Instead of ending up with performant code, work on parallelizing code often ends up in headaches and frustration.

In this article, I want to show that this does not have to be the case. In simple scenarios, code can be simple to parallelize. Python is known to be an easy-to-understand programming language, and parallel code can also be easy to read and implement.

This article is not an introduction to parallelization. It’s not comprehensive. Instead, I want to show you how simple it can be to parallelize code in simple situations. This should give you the necessary background to apply parallelization to more complex scenarios. Or at least, it’s always better to start with the simple than the hard.

Simple examples do not have to mean academic examples. As Python developers or data scientists, you will frequently find yourself in situations in which you need to speed up your application. The first thing you should do is profile the code. Neglecting to do this is likely the most common mistake in attempting to optimize it. Profiling can show you bottlenecks in your application, places where you can achieve some gains. If you do not profile your code, you may spend a day optimizing a section of your application that is actually the most efficient.

After profiling, you can, for example, find that the application spends 99% of the time running one function on an iterable that performs some computations. You may think, then, that if you run this function in parallel, the application should be much quicker — and you could be right. Such scenarios are what I mean by simple situations.

This article aims to show how to simply parallelize Python code in such situations. You may learn one more thing, though: how to design and prototype code. I will not write an actual application, as such code could be too distracting to us. Instead, I will write a simple prototype of such an application. The prototype will work, but will not do anything valuable; it will just do something that resembles what such an application could do in real life.

Imagine we aim to write an application that processes texts. This processing can be anything you want to do with a text. You can, for instance, search for fingerprints in books, as did Ben Blatt in his fantastic book on Nabokov (Blatt 2017).

In our application, for each text, the following pipeline will be run:

  • read a text file
  • preprocess the file; this means cleaning and checking
  • process the file
  • return the results

An elegant way to do this is to create a generator pipeline. I showed how to do in my other article. We will use this approach.

Below, you will find a function that creates the above pipeline, and three functions that the pipeline uses. The pipeline function will be responsible for running all of these functions, and then returning the results.

I will use simple mocks of these functions, because I do not want you to focus on code that is not related to parallelization. I will also add some time sleeps, so that the functions take some proportional time to that they could take in reality.

import time
import pathlib
from typing import Dict, Generator, Iterable# type aliases
ResultsType = Dict[str, int]
def read_file(path: pathlib.Path) -> str:
# a file is read to a string and returned
# here, it's mocked by the file's name as a string
# this step should be rather quick
time.sleep(.05)
text = path.name
return text
def preprocess(text: str) -> str:
# preprocessing is done here
# we make upper case of text
# longer than reading but quite shorter than processing
time.sleep(.25)
text = text.upper()
return text
def process(text: str) -> ResultsType:
# the main process is run here
# we return the number of "A" and "O" letters in text
# this is the longest process among all
time.sleep(1.)
search = ("A", "B", )
results = {letter: text.count(letter) for letter in search}
return results
def pipeline(path: pathlib.Path) -> ResultsType:
text = read_file(path)
preprocessed = preprocess(text)
processed = process(preprocessed)
return processed

Now, we need to have an iterable of paths. A real read_file() function would read a file, but our mock does not; instead, it makes a text from the path’s name. I made it that way to make things as simple as possible. So, this will be our iterable:

file_paths = (
pathlib.Path(p)
for p in (
"book_about_python.txt",
"book_about_java.txt",
"book_about_c.txt",
"science_fiction_book.txt",
"lolita.txt",
"go_there_and_return.txt",
"statistics_for_dummies.txt",
"data_science_part_1.txt",
"data_science_part_2.txt",
"data_science_part_3.txt",
)
)

We have all we need: a file_paths generator that produces paths, as pathlib.Path, to files. We can now run the whole pipeline using this generator, and that way we will use a generator pipeline, as I promised. In order to do so, we can iterate over the file_path generator, and during each iteration, each path will be processed via the pipeline() function. That way, we will use a generator pipeline, as promised.

Let’s perform the evaluation of the generator in the name-main block:

if __name__ == "__main__":
start = time.perf_counter()
results = {path: pipeline(path) for path in file_paths}
print(f"Elapsed time: {round(time.perf_counter() - start, 2)}")
print(results)

This works. I added measuring time. Since we have 10 files to process, and one file is processed in about 1.3 sec, we should expect the whole pipeline to run in about 13 sec plus some overhead.

And indeed, on my machine, the whole pipeline needed 13.02 sec to process.

Above, we evaluated the pipeline using a dictionary comprehension. We could use either a generator expression or the map() function:

if __name__ == "__main__":
start = time.perf_counter()
pipeline_gen = map(lambda p: (p, pipeline(p)), file_paths)
# or we can use a generator expression:
# pipeline_gen = ((path, pipeline(path)) for path in file_paths)
results = dict(pipeline_gen)
print(f"Elapsed time: {round(time.perf_counter() - start, 2)}")
print(results)

The call to map() looks more difficult than the above dict comprehension. This is because we want to return not only the results of pipeline(), but also a path being processed. It’s a frequent situation, when we need to be able to link the result to the input. Here we do so by returning a tuple of two items, (path, pipeline(path)), and based on these tuples, we can create a dictionary with path-pipeline(path) key-value pairs. Using lambdas with map() is a frequent approach— but not necessarily the clearest one.

Note that if we do not need the name of the path, the code would be much simpler, as the only thing we would need is the results of pipeline(path):

pipeline_gen = map(pipeline, file_paths)

As we need to add the path name, however, we used the lambda function. The resulting code, the one with lambda, does not look too attractive, does it? Fortunately, there’s a nice simple trick to simplify calls to map() that use lambdas. We can use a wrapper function instead:

def evaluate_pipeline(path):
return path, pipeline(path)
if __name__ == "__main__":
start = time.perf_counter()
pipeline_gen = map(evaluate_pipeline, file_paths)
results = dict(pipeline_gen)
print(f"Elapsed time: {round(time.perf_counter() - start, 2)}")
print(results)

The map object above yields a generator of two-element tuples (path, pipeline(path)). We can use dict(pipeline_gen) to

  • evaluate the generator, and
  • convert the results to a dictionary consisting of pathpipeline(path) key-value pairs.

Don’t you agree that this code is more readable and much easier to understand than the version with lambda?

By the way, if you’re wondering where the generator pipeline is, it’s here: map(evaluate_pipeline, file_paths). We evaluate it using the dict() function.

Why are we talking that much about map()?!

You may be wondering why I am focusing that much on the map() function. What’s going on? Shouldn’t we be discussing parallelization?

We should — and we do. In order to parallelize your code, you need first to understand how map() works. This is because quite often you will use a very similar function to map(); in fact, quite often this function will even be called map. Therefore, whenever you think of parallelizing your code, instead of a for loop, generator expression, list comprehension, dictionary comprehension, set comprehension, or the like — consider using the map() function.

As I wrote in my Towards Data Science article about map() (Kozak 2022), this function is used by many parallelization modules and techniques, and their versions of map() work very similarly to the built-in map() function. Therefore, if you plan to parallelize code, it’s often good to use map() from the beginning. It’s not a must; just a suggestion. That way, you can save some time later, when parallelizing the code.

So, maybe it’s better to start off with a map() version without even considering whether you will parallelize the code? On the other hand, map() can sometimes be less readable than the corresponding generator expression; but then you can use a wrapper function, and the map()-based code should become quite readable.

I am not going to tell you which design is the best. Sometimes, map() will be the perfect solution — especially when you plan to parallelize the code. The important point is, always be careful. Code readability is important and not always straightforward. There is not one single recipe. Use your skill and experience to decide.

But if you decide to parallelize the code with one of the most popular parallelization packages, more often than not you will end up using a function similar to map(). So, do not ignore the built-in map() function. Tomorrow, it may become a closer friend of yours than you expect today.

Now that we used map(), our code is ready to be parallelized. To this end, let’s use multiprocessing, the standard-library module for parallelization:

import multiprocessing

In most situations, this may be a good idea. We need to remember, however, that other modules can offer better performance. I hope to write about them in other articles.

Our name-main block becomes now:

if __name__ == "__main__":
start = time.perf_counter()
with mp.Pool(4) as p:
pipeline_gen = dict(p.map(evaluate_pipeline, file_paths))
print(f"Elapsed time: {round(time.perf_counter() - start, 2)}")
print(results)

Four cores, and only 4.0 sec! (With 8 workers, it took 2.65 sec.) Nice; we see that multiprocessing works as expected.

Note that we did not need to change a lot in the code, with only two changes:

  • with mp.Pool(4) as p:: It’s a good rule to use a context manager for mp.Pool, as you do not need to remember to close the pool. I chose 4 workers, which means the process will be run in four processes. If you want to not hard-code the number of workers, you can do it in a different way, as shown below. I used mp.cpu_count()-1 so that one process is left untouched by our application; mp.cpu_count() returns how many cores there are to be used. I have 4 physical and 8 logical cores, and mp.cpu_count() returns 8.
workers = mp.cpu_count() - 1
with mp.Pool(workers) as p:
  • p.map(evaluate_pipeline, file_paths): The only change we made is changing map() to p.map(). Remember how I told you that using map() will make it easier to parallelize code? Here you are, this is this small change. Do not forget, however, that p.map() does not evaluate lazily like map() but does so greedily (immediately). Therefore, it does not return a generator; instead, it returns a list.
import multiprocessing as mp
import random
import time
def some_calculations(x: float) -> float:
time.sleep(.5)
# some calculations are done
return x**2
if __name__ == "__main__":
x_list = [random.random() for _ in range(20)]
with mp.Pool(4) as p:
x_recalculated = p.map(some_calculations, x_list)

And that’s it — this script uses parallel computing to run some_calcualations() for x_list. It’s quite a simple piece of Python code, isn’t it? You can return to it whenever you want to use parallel computing in a simple situation like this.

A good thing is that if you want to run more parallel computations, you can use another such context manager for mp.Pool.

I promised to show you that parallelization in Python can be simple. The first example was a prototype of an application, and it was a little complex, just like real-life applications are. But note that the parallel part of the code was very short and rather simple. The second example reinforces this. The complete code has only a dozen or so lines — and simple code, for that matter.

Do not be afraid to use parallelization even in simple situations. Of course, use it only when you need it. It makes no sense to use it when you do not need to make your code faster. Do remember that when the function to be called in parallel is very quick, you may make your application slower, since parallelization does not come without cost; this time, the cost is the overhead from creating the parallel backend.

I showed that parallelization in Python can be simple, but you can do much more with it. I will discuss this in my next articles. If you do not want to limit your knowledge to such basics, Micha Gorelick and Ian Ozsvald’s book (2020) is a very good source — not a simple one, though, as the authors discuss quite advanced topics.

multiprocessing is not always the best approach. Sometimes it’s better to use other tools. There is pathos, there is ray, and there are other tools. Sometimes your environment induces the choice of a parallelization backend. For example, in AWS SageMaker, you will need to use ray. This package offers much more than multiprocessing does, and I will try to write an article about that, too.

That’s it for today. Thanks for reading. If you felt intimidated by parallelization in Python, I hope you do not feel that way anymore. Try it yourself, and note how easy basic parallelization is. Of course, there is much more to it, but let’s learn it step by step. Today, you learned that there are no reasons to be afraid of it, and the basics are… basic and simple.

I did not go into technical details, such as other ways of parallelizing the code instead of a pool, naming, and other things like that. I just wanted to show you how to do it, and how simple it can be. I hope I succeeded in this.

Thanks for reading this article, and please share your thoughts about parallelization in the comments. I will be happy to read your first-attempt stories: What were your first attempts with parallelization? What did you parallelize? Did you have any problems? Did it work as you expected? Were you satisfied by how it worked?

Parallelization in Python: The Easy Way Republished from Source https://towardsdatascience.com/parallelization-in-python-the-easy-way-aa03ed04c209?source=rss—-7f60cf5620c9—4 via https://towardsdatascience.com/feed

<!–

–>

[ad_2]

Source link