Numpy Vectorization

As you may already know, I have been taking several AI / ML related courses. I am a firm believer in always keep on learning. Some time ago I read a report about people in the USA reading books. The statistic that called my attention was:  42% of college graduates never read another book after college. That seems to me quite disturbing. Another statistic is:  57% of new books are not read to completion. To this indicates that a) readers are not committed to learning and / or books are getting worse.

When I purchase a book, I write the year in which I purchase it and my name in the first page. While reading, I underline and make notes. This is why I never borrow a book. I purchase (mostly via the books I read. After I pick up a book, I might not be interested in every single chapter or section, but I do read them in their entirety.

In the past several months, I decided to subscribe for online classes. The advantage is that you have to take graded quizzes and assignments. That seems to me like a good way to verify that one understands the subject matter. In many cases what I learn might not apply directly to what I am doing at work today, but you never know what the future might bring. In my Google email I have the following phrase:  “Luck is what happens when preparation meets opportunity”. Taking courses and reading (technical books and papers) falls under preparation. There is less than one can do regarding opportunity.

In this post I am going to talk about vectorization, which entails a set of operations one can use to eliminate the need of writing loops in Python when using the Numpy library.  As you might already know Python is one of the most popular languages used by data scientists. Numpy is one of the most common libraries used for numerical analysis in Python.

What does vectorization do for you? It parallelize operations. Most people have heard of multiple cores and multiple CPUs in a computer. In general, each core can perform a single instruction per cycle. By having multiple cores and multiple CPUs, one can execute programs that can do a few things at the same time. For example, one can run a word processor and a spreadsheet at once. This does not imply that you can only run as many applications in your computer as the number of cores. You can have many more applications open, but at most you can have simultaneously running one application per core. Of course, the operating system (OS) tends to do a great job of giving us the perception of running dozens of applications at once when your system only has a single CPU with four cores.

When you are developing ML programs, you will run into large amounts of data that needs to be processed as quickly and efficiently as possible. For example, if we wish to process an array of 1,000,000 entries, it would be nice to spawn a set of threads and assign each a portion of the array. This would work if each thread is guaranteed that they would not interfere with each other. If a thread needs to wait for others to be done, then that might need a lot more coordination and coding. It turns out that there are some operations that can be performed in actuality in parallel, and the developer does not need to write large amounts of complicated code.

According to Andrew Ng, vectorization is basically the art of getting rid of explicit for loops in your code. Andrew is the main instructor for the Coursera course I am currently taking: Neural Networks and Deep Learning.

For an example on vectorization we could open a Jupyter Notebook. Before opening the notebook, I will open a new Chrome web browser window (you may open a window with your favorite browser). If you do not, Jupyter will open one for you or use an existing window. Since I keep always open (but silent) multiple windows with Gmail, and Google calendar, among others, I prefer to have my notebook in a dedicated window. In my case I am using Windows, so from a command prompt:

C:\> jupyter notebook
[I 08:34:59.204 NotebookApp] [nb_conda_kernels] enabled, 3 kernels found
[I 08:35:16.027 NotebookApp] [nb_anacondacloud] enabled
[I 08:35:16.085 NotebookApp] [nb_conda] enabled
[I 08:35:17.437 NotebookApp] \u2713 nbpresent HTML export ENABLED
[W 08:35:17.438 NotebookApp] \u2717 nbpresent PDF export DISABLED: No module named 'nbbrowserpdf'
[I 08:35:17.440 NotebookApp] Serving notebooks from local directory: C:\
[I 08:35:17.440 NotebookApp] The Jupyter Notebook is running at:
[I 08:35:17.440 NotebookApp] http://localhost:8888/?token=4933a79cb0922ca3e54e4601f817ee77f5e2551e9e745613
[I 08:35:17.441 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
[C 08:35:17.450 NotebookApp]

Once the notebook opens, click on New->Python 3 to start a new notebook.

Let’s import a couple libraries:

import numpy as np
import time

Now we will create a couple arrays with 1,000,000 random entries each. We will use them to perform the same calculation using two different approaches.

a = np.random.rand(1000000)
b = np.random.rand(1000000)

We will start by performing a dot product between arrays a and b using a for loop. We will time the operation. In addition, we will display the value to later compare the results:

c = 0
tic = time.time()
for i in range(1000000):
    c += a[i] * b[i]
toc = time.time()
print("c: " + str(c))
print("c.dtype: " + str(c.dtype))
print("not vectorized: " + str(1000 * (toc - tic)) + " ms")

The results follow:

c: 250072.8834083619
c.dtype: float64
not vectorized: 862.0200157165527 ms

Now the same operation but using vectorization:

c = 0
tic = time.time()
c =,b)
toc = time.time()
print("c: " + str(c))
print("c.dtype: " + str(c.dtype))
print("vectorized: " + str(1000 * (toc - tic)) + " ms")

The results follow:

c: 250072.88340835375
c.dtype: float64
vectorized: 3.999948501586914 ms

The times are approximate in the sense that you may run the same computations several times and the times will be comparable but not the same. That is the result of the operating system and the load on your computer.

Also note, that the results are quite similar but not identical. This is due to the the use of floating point numbers. I will not cover this topic in this post.

You may have heard of GPUs as opposed to CPUs. A GPU is a Graphic Processing Unit. Nvidia is one of the largest manufacturers of GPUs in the world. I happen to own a Tesla card from Nvidia with about 1,000 GPUs in one of my Linux computers.

SIMD which stands for Single Instruction Multiple Data refers to the call that we used to generate a dot product in our example. Numpy will use SIMD implementations with or without GPU support.

For a list of Numpy functions you may refer here. To see the actual list of functions, click on <index>. An alphabetical ordered list will display. I guess it would take an individual working exclusively with Python and Numpy years to master so many powerful calls.

In conclusion, if you want to write fast Python code you must try to eliminate most for loops and replace them with specific Numpy SIMD calls.

If you have comments or questions regarding this or any other post in this blog, please do not hesitate and leave me a note.

Keep on learning;


Follow me on Twitter:  @john_canessa

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.