### Beyond vectorization: writing efficient numpy code

Numpy is a great library for expressing computations over vectors, matrices, and multi-dimensional arrays in python due to it’s simplicity and efficiency. In a previous blog post, I discussed fairly well-known techniques for speeding up (and cleaning up) numpy code by avoiding loops and exploiting problem structure. I showed that when you use the suggestions in that post, you can speed up your numpy code by orders of magnitude. In this blog post, I’ll show you how to get even more speed out of your numpy code on top of the improvements you get from vectorization. As a simple example to illustrate an inefficiency of numpy, consider computations of the form z = 0.2*x + 0.8*y where x and y are large numpy arrays. This computation will form intermediate arrays 0.2*x and 0.8*y, so the memory overhead can be problematic and slow down the computation. On my laptop, it takes about 1 second for arrays of size 100 million:
import numpy as np

def foo(x,y):
return 0.2*x + 0.8*y

In [1]: x = np.random.rand(10**8)
In [2]: y = np.random.rand(10**8)
In [3]: %timeit z = foo(x,y)
1.03 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Of course, with a for loop we can avoid these intermediate arrays but the computation will be even slower because the work is not offloaded to C. Unfortunately, numpy doesn’t provide a nice solution to this problem. One pure numpy approach is to write the following code:
def bar(x,y):
x *= 0.2
y *= 0.8
x += y
return x

In [1]: x = np.random.rand(10**8)
In [2]: y = np.random.rand(10**8)
In [3]: %timeit z = bar(x,y)
288 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The above code doesn’t create any intermediate arrays, so it is faster than before, but it still requires 3 passes over the data. Furthermore, x and y are modified in place which could cause problems downstream. Another idea is to write a for loop in cython, which avoids both of these problem, or even better — use the numexpr library, which was specifically designed to solve these types of problems:
import numexpr as ne

def baz(x, y):
return ne.evaluate(‘0.2*x + 0.8*y’)

In [1]: x = np.random.rand(10**8)
In [2]: y = np.random.rand(10**8)
In [3]: %timeit z = baz(x,y)
287 ms ± 8.01 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This gives a 4X speedup over the original implementation and doesn’t modify anything in place. If we pre-allocate the output vector we can speed it up by another factor of 2, as follows:
import numexpr as ne

def baz(x, y, out=None):
return ne.evaluate(‘0.2*x + 0.8*y’, out=out)

In [1]: x = np.random.rand(10**8)
In [2]: y = np.random.rand(10**8)
In [3]: z = np.empty(10**8)
In [4]: %timeit z = baz(x,y, out=z)
156 ms ± 944 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

At the end of the day, we can speed up our numpy code by a factor of 8 just by doing in place operations and avoiding intermediate arrays. It would be nice if numpy natively supported this type of functionality without relying on an external library, but for performance-critical applications you should definitely consider using numexpr.