3 quick tips every Pythonist should know to Speed Up On the Fly

Tired of getting 10 coffees waiting for your code to finish?
I was, so I decided to share through my first Medium post, an overview of three easy python performance tips that I wish I had knew before, to reduce a bit the C++ bullying.

SPOILER ALERT On my next post I will show how to apply this tips in the context of a Multi Objectives Genetic Algorithms, so also keep tuned for my next post.

So shall we?

TIP 1 Local Parallelization using ProcessPoolExecutor

ProcessPoolExecutor comes with the standard library and allows a user friendly multiprocessing of your code, in case you have more than one core. Simplifying, processes that do not share memory, can be run in parallel (one in each core).

Below is an example of how easy it is to benefit from the ProcessPool,
The function verifies, if the numbers in the list is prime (Return True) or not (False).

Please run in a python script, not Jupyter (it will return a BrokenProcessPool error).

The run_base() provides us a base case of time and the run_process_pool() the case using multiprocessing (note that my notebook only have 2 cores).

In my machine, even in this small code there was a difference of around 4secs.

Time Base Case 14.7976806
Time with ProcessPoolExecutor 10.7335183

Did your code performed similarly, better, worse?

Note however that ProcessPool is restricted to picklable objects (please take a look at the docs.

If you want to learn more on local concurrency using the standard python library, I recommend this amazing tutorial on multiprocessing.

TIP 2 Compiling python code with Numba

My second tip is to use numba library, it compiles python code on the fly to machine code. Yes, that´s correct and we can also save a loooooot of time here. We just need to understand some things first.

Numba Summary

  1. Numba works especially well with numpy and multiple loops.
  2. You’ll probably not be able to compile your whole code at once.
  3. Test numba in small and low performance functions (Spoiler to understand how to prioritize wait till Tip 3), that contains numpy and multiple loops.
  4. Before compiling look at the supported python and numpy functions.
  5. Do not mix data types, numba won´t help you as python does by automatically converting for instance int64 to int32.

Now as always let´s test. First let´s install the package, open your anaconda/pip terminal and execute conda install numba.

Next, let´s test out previous example using numba.

So first copy with a different name the is_prime function and add the numba decorator on top @jit(nopython=True,fastmath=True).

Thaaat´s it! Now we just have to add the call to the numba function.

Now to the results:

Time Base Case 14.205312000000001
Time with ProcessPoolExecutor 9.930437499999998
Numba time with compiling
Time Numba 1.3300917000000005
Numba time after compiling
Time Numba 0.9918250000000022

Indeed my friends, compiled code is marvelous we spared around 13 s per call. I still love python though, especially now with numba, given the straightforward use of numba decorators in current code without need of much modification.

Note that the first time we run the numba code it will take longer, as it is compiling, however on the next run the code is faster as it is already compiled.

Now if ProcessPool leverages from local parallelization and numba leverages from compiled code, what if we mixed them up together?

Simple, we just have to test. All we have to do is copy our function run_process_pool again and modify the function being called.

Time Base Case 14.7415336Time with ProcessPoolExecutor 10.2182718Numba time with compilingTime Numba 1.4119825000000006Numba time after compilingTime Numba 0.9829408000000015Time with ProcessPoolExecutor with numba 2.9859480000000005

Well that is disapointing using ProcessPool and Numba is slower than only numba, but understandable. Our problem might be too small to to overcome the overhead of the ProcessPool and my machine only have 2 cores, the result in your machine can be really different.

So now lets try to test the hipothesis, increase our number of values to test. We will add more 100000 numbers, using the function below:

https://gist.github.com/DTKx/1807527329a5e22719d798a9ba862bc0

Time Base Case 15.024390200000001Time with ProcessPoolExecutor 10.4288567Numba time with compilingTime Numba 1.4126227Numba time after compilingTime Numba 1.2121131999999974Time with ProcessPoolExecutor with numba 2.2224521000000017Add more 100000 numbersTime Base Case 20.258951299999996Time with ProcessPoolExecutor 122.43605180000002Time Numba 1.3219542999999874Time with ProcessPoolExecutor with numba 227.59259859999997

Uoooou, that is a lot of overhead with a larger number of numbers. This could be related to the cost of our prime function versus the overhead of ProcessPoolExecutor, or the code is sharing resources and requires locks in the given example. For instance in more complicated functions it might be worthy, so this leads us to the final tip.

Tip 3 Benchmark effectively your code with Cprofiling

As you saw, talking about performance several cases, no matter the framework used, it might lead to different results. So one of the key aspects of performance is to know how to effectively benchmark your code, understanding the highest costs.
So you know where to focus your precious efforts.
For this we have the cprofile of the standard python library, let´s see in practice by profiling our run_process_pool_numba and run_numba with 100035 numbers:

The result in the output txt was:

  • Profile run_process_pool_numba()
  • Profile run_numba()

* Our code using the run_process_pool_numba spend most of the time (more than 90%) waiting for the lock to be released, please check the official website on Lock Objects.

Indicating we can either test other structures or confidently select our numba approach.

Hope that this overview provided new insights on how to speed up on the fly your code and helped you save at least your 5 min reading time :).

I really enjoyed the experience of writing my first post, therefore I would loove to hear from you. So please if you have any other doubts, tips, comments (how did your code perform?), improvements or if this helped you please comment below, you can also check the full code in my Github.

Next time we will use this tips in a Multiobjective Genetic Algorithm, so if you have any interest please stay tuned.

Thank you, see ya!

Master student of Computer Sciences, passionated about quantifying complex trade offs.