# Numpy Vectorization – Revisited It is the last Sunday in January 2019 and is relatively cold in the Twin Cities of Minneapolis and St. Paul. The computerized mercury scale indicates -12F not taking into account wind. As usual, get up before 05:00 AM and get in my first 2 hour block of Deep Work. I am in the process of reviewing the last course I took on neural networks and deep learning.

Before I get to my notes on vectorization, yesterday, I was supposed to make some pizza from scratch for lunch. My wife and I came back from a walk at the Mall of America (MOA), prepared some flavored (mango) green tea, turned on the fireplace and sat down to sip tea and chat. Suddenly it was close to noon and no lunch in the horizon. It takes me about two hours to get from high gluten flour and several other ingredients to make two hot pizzas ready to be devoured. The dough takes about an hour to rise.

OK, decided to go shrimp pasta. The pasta came from a bag. Homemade pasta takes about an hour to prepare. My wife chopped some onions, put them in a pan with some olive oil, chopped some garlic and joined the onions with some spices and salt. When the onions started to caramelize, in went a bag of frozen shrimp and about a cup of white wine.

Meanwhile, on separate burner water, oil and salt was starting to boil in a pan. As soon as it started to boil, in went the pasta. In five minutes we were ready to eat our lunch. Besides getting the ingredients from the pantry and freezer, my wife did all the cooking. My help came in the form of an appetizer. I pulled some salami and sliced it. Got a bag with three different types of cheese, some crackers, and poured some white wine in a couple glasses. We had in the fridge an open bottle left from last weekend.

While we were fixing lunch and enjoying our appetizer, I noticed that the crackers came from England. The set of cheese came from Spain. The salami and the first bottle of wine were products of Italy. It is amazing how the economy works. It seems that most countries in the world import and export different products so we can all share them. During the lunch preparation we had to open a second bottle. I could not find the same type of wine so picked up a French one that I placed in the fridge last week as soon as we open the one that we started with.

OK, it is about 07:00 AM today, I have not finished this post and I am getting hungry for breakfast. For the last couple years, my wife and I just indulge on the same breakfast every day and a relatively late lunch. We completely skip dinner. Earlier this morning, I did some experimentation with the Jupyter notebook so I have all things ready to finish this post. Will break for breakfast and after a warm shower will get in a second block. When done my wife and I will go to the MOA for a walk. I believe my son made some lasagna and will stop for lunch after our walk… … I am back. While my wife gets ready I will try to finish this post. This post contains a newer post based on this one. I am trying to improve on the presentation of the code for my blog. The associated Jupyter notebook may be found in Github under:  JohnCanessa/Numpy-Vectorization

At the core is the feature called SIMD which stands for Single Instruction Multiple Data. There are implementations for CPUs (Central Processing Unit) and GPUs (Graphics Processing Unit). Most modern computers support SIMD.

An example of a SIMD function is the dot product implementation in Numpy. Following is a modified example using the Jupyter Notebook. A few months ago I would invoke the notebook directly from the command line. After the last Anaconda update, a set of tools that include the notebook have been collected and presented under Anaconda Navigator. I opened a Jupyter Notebook and named it vectorization.

The first cell follows:

```# **** import libraries we will need ****
import numpy as np
import time

# **** create, initialize with specific values, and print the contents of a numpy array ****
a = np.array([1, 2, 3, 4])
print(a)

# **** create, initialize with random values, and print the contents of a numpy array ****
b = np.random.rand(4)
print(b)
```

We need to import the Numpy and time libraries. Numpy will be used to select and execute a SIMD instruction and time to time it and later compare it against the same operation using a loop.  An array is also defined. This is done just to see how to define and populate a small array using Numpy. The results cell follows:

```[1 2 3 4]
[0.81080152 0.43937166 0.73893081 0.3828071 ]
```

Now let’s declare a couple numpy arrays. They will hold 1000000 entries each and will be initialized with random numbers. We will then display the shape of the arrays.

```# **** make the results repeat ****
np.random.seed(123)

# **** declare 2 large arrays holding random numbers ****
a = np.random.rand(1000000)
b = np.random.rand(1000000)

# **** display their shape ****
print("a.shape: " + str(a.shape))
print("b.shape: " + str(b.shape))
print()

# ***** print the first 5 entries of array a ****
for i in range(5):
print("a: " + str(a[i]))
print()

# **** print the first 5 entries of array b ****
for i in range (5):
print("b: " + str(b[i]))
```

The output follows:

```a.shape: (1000000,)
b.shape: (1000000,)

a: 0.6964691855978616
a: 0.28613933495037946
a: 0.2268514535642031
a: 0.5513147690828912
a: 0.7194689697855631

b: 0.7726979824010617
b: 0.46711505893558836
b: 0.019948827345841025
b: 0.9234407522220117
b: 0.179140987510242
```

The results illustrate that we have been successful at creating the two arrays. Each array contains a set of 1,000,000 random numbers in the range [0 : 1]

Let’s first try and time multiplying the two arrays in a loop. The Python code follows:

```# **** initialize c ****
c = 0;

# **** populate c with the element multiplication of a and b ****
tic = time.time()
for i in range(1000000):
c += a[i] * b[i]
toc = time.time()

# **** display the result of the element multiplication and the time the operation took ****
print("c: " + str(c))
print("non-vectorized time: " + str(1000 * (toc - tic)) + " ms")
timeNV = tic - toc
```

We write a loop which cycles between 0 and 999999. For each pass we just multiply the associated entries and add them to our accumulator c. Note that we initialize c to zero. This is done to make sure that the answers match. The result and the time the operation took are displayed.

```c: 250123.2279705272
non-vectorized time: 1063.934564590454 ms
```

In this pass, the output took about a second (1000 ms) to execute. Not bad for multiplying and adding 100000 floating point numbers. Please note that if you rerun the cell, the results will not change. This is based on the fact that the random numbers in arrays a and b are the same in each pass. This is due to the fact that we seeded the random number generator. Seeding it is a common approach to be able to reproduce the same results on each pass. You may want to comment the seed line in order to experiment with different values. The time it takes for the code to execute will vary even if you keep the same numbers. This is due to the fact that the operating system is running different unrelated tasks in the background. In general, the rule of thumb is to run an operation five times, discard the fastest and slowest values and average the residual three times.

Now let’s rewrite the code using a single numpy instruction.

```# **** clear c ****
c = 0

# **** populate c with the element multiplication of a and b (using SIMD) ****
tic = time.time()
c = np.dot(a,b)
toc = time.time()

# **** display the result of the element multiplication and the time the operation took ****
print("c: " + str(c))
print("vectorized time: " + str(1000 * (toc - tic)) + " ms")
timeV = tic - toc
print()
print("ratio: " + str(timeNV / timeV))
```

In this example we replaced the loop with a SIMD to calculate c.

```c: 250123.22797053587
vectorized time: 46.99587821960449 ms

ratio: 22.63889100271415
```

It seems like the vectorized implementation is faster than the non-vectorized. On my machine it seems that in average is about 21 times faster.

Now we will declare a two-dimensional array A of 2,000 by 3,000 and will populate it with random values. Next we will declare a 1-dimensional array or vector v with 3,000 values and will initialize it with random variables. We display the shapes of the two arrays. The reason for this is to pay attention to the dimensions when operating with arrays. Mistakes in code can easily be corrected when all the dimensions are as expected.

```# **** make the results repeat ****
np.random.seed(123)

# **** declare a 2-dimensional array and display its shape ****
A = np.random.rand(2000, 3000)
print("A.shape: " + str(A.shape))
print()

# **** declare a vector and display its shape ****
v = np.random.rand(3000)
print("v.shape: " + str(v.shape))
```

The associated results follow:

```A.shape: (2000, 3000)

v.shape: (3000,)
```

You must multiply A[2000, 3000] * v[3000, 1] = u[2000, 1]. Multiplying v[3000, 1] * A[2000, 3000] if incorrect because the inner dimensions do not match. In this case, we must make sure the proper indices are used for the different arrays.

```# **** declare a vector of 2,000 entries and initialize it with 0s ****
u = np.zeros((2000))

# **** using 2 loops multiply them (time operation) ****
tic = time.time()
for i in range(2000):
for j in range(3000):
u[i] += A[i][j] * v[j]

# **** sum all the values in u ****
d = 0;
for i in range (2000):
d += u[i];
toc = time.time()

# **** display the shape of u ****
print("u.shape: " + str(u.shape))

# **** display the result d and the time the operation took ****
print("d: " + str(d) + " time: " + str(1000 * (toc - tic)) + " ms")
timeNV = tic - toc
```

The results follows:

```u.shape: (2000,)
d: 1476056.2967874217 time: 10245.437622070312 ms
```

In the previous cell we used a total of 3 loops to get the desired result. Now let’s perform the same operations using SIMD.

```# **** initialize u with 0s ****
u = np.zeros((2000))

# **** multiply the vector (time the operation) ****
tic = time.time()
u = np.dot(A, v)

# **** sum all the values in u ****
d = np.sum(u)
toc = time.time()

# **** display the shape of u ****
print("u.shape: " + str(u.shape))

# **** display the result d and the time the operation took ****
print("d: " + str(d) + " time: " + str(1000 * (toc - tic)) + " ms")
timeV = tic - toc
print()
print("ratio: " + str(timeNV / timeV))
```

Note that the 3 loops are gone.

```u.shape: (2000,)
d: 1476056.2967874205 time: 77.99482345581055 ms

ratio: 131.36048224886437
```

The code is easier to follow. Most important the second set of operations ran 91 times faster. When writing code in Python for machine learning, it is important to make it as efficient as possible. That way you will be able to cycle much faster from idea, implementation and results. Remember that we are dealing with big data. Learning algorithms tend to perform better when you use large data sets. If code using loops takes an hour to execute, the SIMD version will take less than a minute. Keep in mind that we are using CPUs, not GPUs. If you ran your Python code assisted by GPUs it will run even faster.

Now we will declare more arrays and will perform some more operations on them.

```# **** x ****
n_x = (64 ** 2) * 3
m = 100;
x = np.random.rand(n_x, m)
print("x.shape: " + str(x.shape))
```

In this case we assume we will process a single RBG (Red Green Blue) image. Each image contains 64 x 64 pixels. Given that a pixel has three values (RGB), an image can be represented with a vector of 64 * 64 * 3 = 12,288 values. In this example we will assume we have 100 images that we will process.

Each image sample of 12,288 values will be in a column and all 100 columns will be stacked horizontally one after the other as illustrated by the following diagram:

```    |&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |

X = | x1, x2, ...xn|

|&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  |
```
```w.shape: (12288, 1)
```

The results displayed on the last cell confirm that the representation in the code matches our intentions.

Lets now declare a vector w which will be populated with random numbers.

As we discussed previously if we wish to multiply A * w we have an issue because the inner dimensions do not match w[12288, 1] * X[12288, 100]. To address this we could use the transpose of w[1, 12288] * X[12288, 100] and obtain a result z[1, 100].

```# **** product of w x ****
z = np.dot(w.T, x)
print("z.shape: " + str(z.shape))
print("z: " + str(z))
```

The results for the operation follows:

```z.shape: (1, 100)
z: [[3085.87233026 3046.39114632 3069.22529654 3037.68336421 3058.33295367
3048.61512356 3055.40568348 3067.77208362 3071.0288812  3040.64311368
3015.85687    3060.64383783 3051.68352211 3060.84942322 3070.33082398
3055.49115851 3051.13846139 3071.20439841 3057.68624992 3059.55023719
3098.15648599 3070.49008807 3077.06024315 3075.02658533 3050.35050974
3049.53968612 3097.61192622 3061.58056677 3037.14022029 3068.56109233
3054.7273441  3084.97632227 3048.57225448 3072.07366812 3033.4120418
3030.43838818 3093.95456072 3060.19723373 3063.7943765  3045.72939758
3048.2201513  3073.42435714 3062.0780248  3065.79595085 3067.03146764
3052.12090204 3048.51231941 3083.62884382 3081.75290352 3088.28332183
3064.12340584 3076.68007496 3053.58572801 3040.38150758 3054.20743184
3089.91710773 3032.23661951 3029.67286539 3047.58913599 3071.77737076
3066.56517946 3094.31748563 3045.95454068 3069.01497062 3077.03057345
3088.0165695  3065.47594911 3072.5629795  3045.42227672 3037.85169285
3101.0988355  3053.80059894 3059.63561919 3035.76101002 3041.47629427
3019.10585638 3041.1160453  3081.60127841 3041.2010014  3069.25565361
3087.3403284  3050.12226209 3057.99023889 3062.84120567 3072.84679536
3043.83350924 3043.44208909 3080.75478528 3051.54419506 3081.11957024
3051.50677247 3057.38595603 3056.20378222 3059.67877424 3077.85082191
3090.82622145 3061.30740209 3075.42885329 3044.04411145 3046.31623274]]
```

As you can see, the shape of the results is as expected.

For a final operation we will declare a random value b and will add it to each element in the z array. This operation can be performed as follows:

```# **** add b ****
b = np.random.rand()
print("b: " + str(b))
z += b
print("z: " + str(z))
```

The results follow:

```b: 0.3964574632024054
z: [[3086.26878772 3046.78760378 3069.62175401 3038.07982168 3058.72941113
3049.01158103 3055.80214094 3068.16854108 3071.42533867 3041.03957114
3016.25332746 3061.0402953  3052.07997957 3061.24588068 3070.72728144
3055.88761598 3051.53491886 3071.60085588 3058.08270739 3059.94669465
3098.55294345 3070.88654553 3077.45670061 3075.42304279 3050.7469672
3049.93614359 3098.00838368 3061.97702424 3037.53667775 3068.95754979
3055.12380156 3085.37277974 3048.96871194 3072.47012558 3033.80849927
3030.83484565 3094.35101818 3060.59369119 3064.19083396 3046.12585504
3048.61660876 3073.8208146  3062.47448227 3066.19240832 3067.4279251
3052.5173595  3048.90877687 3084.02530128 3082.14936099 3088.67977929
3064.5198633  3077.07653242 3053.98218547 3040.77796504 3054.6038893
3090.31356519 3032.63307697 3030.06932285 3047.98559345 3072.17382823
3066.96163692 3094.71394309 3046.35099815 3069.41142809 3077.42703091
3088.41302696 3065.87240658 3072.95943696 3045.81873418 3038.24815031
3101.49529296 3054.19705641 3060.03207666 3036.15746748 3041.87275174
3019.50231384 3041.51250277 3081.99773588 3041.59745887 3069.65211107
3087.73678586 3050.51871955 3058.38669635 3063.23766314 3073.24325283
3044.2299667  3043.83854655 3081.15124274 3051.94065252 3081.5160277
3051.90322993 3057.78241349 3056.60023969 3060.07523171 3078.24727937
3091.22267891 3061.70385956 3075.82531076 3044.44056891 3046.7126902 ]]
```

It seems we added b to each value in z but we did not use a vector to represent b. What happened is that Python implies that what you want is to add b to each element in z and “broadcasts” b so each element in z gets updated z += b. To learn more about broadcasting you may read about the topic here.

The last example computed the following pseudo code:

Z = transpose(w) * X + b

The last line is not written in Python; it is just pseudo code. This represents the forward propagation for logistic regression. There is a backward propagation pass which I will cover in a future post. Logistic regression, once properly trained, allows us to determine if an image contains (1) or does not contain (0) an object. In the course I took, the idea is to determine if a random picture contains or does not contain a cat.

You can find the Jupyter notebook in Github:  JohnCanessa/Numpy-Vectorization

My suggestion is to experiment with the code and make sure you understand what is happening. The ideas seem to be simple, but the implementation is open to encounter issues.

If you are interested in ML, I would recommend the Coursera course “Neural Networks and Deep Learning” by Andrew Ng.

Keep on learning;

John