My favorites | Sign in
Project Logo
                
Search
for
Updated Dec 22, 2009 by faltet
Labels: Featured
Overview  

Description

The numexpr package supplies routines for the fast evaluation of array expressions elementwise by using a vector-based virtual machine. It's comparable to SciPy's weave package, but doesn't require a separate compile step of C or C++ code.

Using it is simple:

>>> import numpy as np
>>> import numexpr as ne
>>> a = np.arange(10)
>>> b = np.arange(0, 20, 2)
>>> c = ne.evaluate("2*a+3*b")
>>> c
array([ 0,  8, 16, 24, 32, 40, 48, 56, 64, 72])

Building

Numexpr requires Python 2.4 or greater, and NumPy 1.2 or greater. It's built in the standard Python way:

$ python setup.py build
$ python setup.py install

You can test numexpr with:

$ python -c "import numexpr; numexpr.test()"

Enabling Intel's VML support

Starting from release 1.2 on, numexpr includes support for Intel's VML library. This allows for better performance on Intel architectures, mainly when evaluating transcendental functions (trigonometrical, exponential...). It also enables numexpr using several CPU cores.

If you have Intel's MKL (the library that embeds VML), just copy the site.cfg.example that comes in the distribution to site.cfg and edit the latter giving proper directions on how to find your MKL libraries in your system. After doing this, you can proceed with the usual building instructions listed above. Pay attention to the messages during the building process in order to know whether MKL has been detected or not. Finally, you can check the speed-ups on your machine by running the bench/vml_timing.py script (you can play with different parameters to the set_vml_accuracy_mode() and set_vml_num_threads() functions in the script so as to see how it would affect performance).

Usage Notes

Numexpr's principal routine is:

evaluate(ex, local_dict=None, global_dict=None, **kwargs)

where ex is a string forming an expression, like "2*a+3*b". The values for a and b will by default be taken from the calling function's frame (through the use of sys._getframe()). Alternatively, they can be specified using the local_dict or global_dict arguments, or passed as keyword arguments.

Expressions are cached, so reuse is fast. Arrays or scalars are allowed for the variables, which must be of type 8-bit boolean (bool), 32-bit signed integer (int), 64-bit signed integer (long), double-precision floating point number (float), 2x64-bit, double-precision complex number (complex) or raw string of bytes (str). If they are not in the previous set of types, they will be properly upcasted for internal use (the result will be affected as well). The arrays must all be the same size.

Datatypes supported internally

Numexpr operates internally only with the following types:

If the arrays in the expression does not match any of these types, they will be upcasted to one of the above types (following the usual type inference rules, see below). Have this in mind when doing estimations about the memory consumption during the computation of your expressions.

Also, the types in Numexpr conditions are somewhat stricter than those of Python. For instance, the only valid constants for booleans are True and False, and they are never automatically cast to integers.

Casting rules

Casting rules in Numexpr follow closely those of NumPy. However, for implementation reasons, there are some known exceptions to this rule, namely:

Supported Operators

Numexpr supports the set of operators listed below:

Supported Functions

The next are the current supported set:

Notes

More functions can be added if you need them.

General Routines

Intel's VML Specific Support Routines

When compiled with Intel's VML (Vector Math Library), you will be able to use some additional functions for controlling its use. These are:

The mode parameter can take the values:
- 'low': Equivalent to VML_LA - low accuracy VML functions are called
- 'high': Equivalent to VML_HA - high accuracy VML functions are called
- 'fast': Equivalent to VML_EP - enhanced performance VML functions are called
It returns the previous mode.
This call is equivalent to the vmlSetMode() in the VML library. See:
http://www.intel.com/software/products/mkl/docs/webhelp/vml/vml_DataTypesAccuracyModes.html
for more info on the accuracy modes.
  • set_vml_num_threads(nthreads): Suggests a maximum number of
  • threads to be used in VML operations.
This function is equivalent to the call mkl_domain_set_num_threads(nthreads, MKL_VML) in the MKL library. See:
http://www.intel.com/software/products/mkl/docs/webhelp/support/functn_mkl_domain_set_num_threads.html
for more info about it.
  • get_vml_version(): Get the VML/MKL library version.

How It Works

The string passed to evaluate is compiled into an object representing the expression and types of the arrays used by the function numexpr.

The expression is first compiled using Python's compile function (this means that the expressions have to be valid Python expressions). From this, the variable names can be taken. The expression is then evaluated using instances of a special object that keep track of what is being done to them, and which builds up the parse tree of the expression.

This parse tree is then compiled to a bytecode program, which describes how to perform the operation element-wise. The virtual machine uses "vector registers": each register is many elements wide (by default, the first pass uses 256 elements). The key to numexpr's speed is handling chunks of elements at a time.

There are two extremes to evaluating an expression elementwise. You can do each operation as arrays, returning temporary arrays. This is what you do when you use NumPy: 2*a+3*b uses three temporary arrays as large as a or b. This strategy wastes memory (a problem if your arrays are large), and also is not a good use of cache memory: for large arrays, the results of 2*a and 3*b won't be in cache when you do the add.

The other extreme is to loop over each element, as in

for i in xrange(len(a)):
    c[i] = 2*a[i] + 3*b[i]

This doesn't consume extra memory, and is good for the cache, but, if the expression is not compiled to machine code, you will have a big case statement (or a bunch of if's) inside the loop, which adds a large overhead for each element, and will hurt the branch-prediction used on the CPU.

numexpr uses a in-between approach. Arrays are handled as chunks (the first pass uses 256 elements) at a time, using a register machine. As Python code, it looks something like this:

for i in xrange(0, len(a), 256):
   r0 = a[i:i+128]
   r1 = b[i:i+128]
   multiply(r0, 2, r2)
   multiply(r1, 3, r3)
   add(r2, r3, r2)
   c[i:i+128] = r2

(remember that the 3-arg form stores the result in the third argument, instead of allocating a new array). This achieves a good balance between cache and branch-prediction. And the virtual machine is written entirely in C, which makes it faster than the Python above.

There is some more information and history at http://isobaric.blogspot.com/2006/03/numerical-expression-evaluator-now-in.html.

Expected Performance

The range of speed-ups for numexpr respect to NumPy can vary from 0.95x and 20x, being 2x, 3x or 4x typical values, depending on the complexity of the expression and the internal optimization of the operators used. The strided and unaligned case has been optimized too, so if the expression contains such arrays, the speed-up can increase significantly. Of course, you will need to operate with large arrays (typically larger than the cache size of your CPU) to see these improvements in performance.

Here there are some real timings. For the contiguous case:

In [1]: import numpy as np

In [2]: import numexpr as ne

In [3]: a = np.random.rand(1e6)

In [4]: b = np.random.rand(1e6)

In [5]: timeit 2*a + 3*b
10 loops, best of 3: 18.9 ms per loop

In [6]: timeit ne.evaluate("2*a + 3*b")
100 loops, best of 3: 5.83 ms per loop   # 3.2x: medium speed-up (simple expr)

In [7]: timeit 2*a + b**10
10 loops, best of 3: 158 ms per loop

In [8]: timeit ne.evaluate("2*a + b**10")
100 loops, best of 3: 7.59 ms per loop   # 20x: large speed-up due to optimised pow()

For unaligned arrays, the speed-ups can be even larger:

In [9]: a = np.empty(1e6, dtype="b1,f8")['f1']

In [10]: b = np.empty(1e6, dtype="b1,f8")['f1']

In [11]: a.flags.aligned, b.flags.aligned
Out[11]: (False, False)

In [12]: a[:] = np.random.rand(len(a))

In [13]: b[:] = np.random.rand(len(b))

In [14]: timeit 2*a + 3*b
10 loops, best of 3: 29.5 ms per loop

In [15]: timeit ne.evaluate("2*a + 3*b")
100 loops, best of 3: 7.46 ms per loop   # ~ 4x speed-up

Credits

Numexpr was initially written by David Cooke, and extended to more types by Tim Hochberg. Francesc Alted contributed support for boolean types, single precision floating point types and for efficient strided and unaligned array operations. Ivan Vilata contributed support for strings. Gregor Thalhammer implemented the support for Intel VML (Vector Math Library).

License

Numexpr is distributed under the MIT license.


Sign in to add a comment
Hosted by Google Code