My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
Benchmarks  
Description of the benchmarks used for Unladen Swallow.
Updated Jan 29, 2010 by collinw

Unladen Swallow Benchmarks

The Unladen Swallow benchmark suite is kept in the tests/ directory (note that this is in the /svn/tests/ tree, parallel to trunk). tests/perf.py is the main interface to the tests, with the individual benchmarks stored under the tests/performance/ directory.

To check out the latest version of the benchmarks:

svn checkout http://unladen-swallow.googlecode.com/svn/tests unladen-bmarks

Example perf.py command:

python2.5 unladen-bmarks/perf.py -r --benchmarks=2to3,django control/python experiment/python

This will run the 2to3 and Django template benchmarks in rigorous mode (lots of iterations), taking control/python as the baseline and experiment/python as the binary you've been mucking around with. perf.py will take care of comparing the performance and running statistics on the result to determine statistical significance.

Quick-Start Guide

Not all benchmarks are created equal: some of the benchmarks listed below are more useful than others. If you're interested in overall system performance, the best guide is this:

python unladen-bmarks/perf.py -r -b default control/python experiment/python

That will run the benchmarks we consider the most important headline indicators of performance.

There's an additional collection of whole-app benchmarks that are important, but take longer to run:

python unladen-bmarks/perf.py -r -b apps control/python experiment/python

Benchmarks

  • 2to3 - have the 2to3 tool translate itself.
  • calls - collection of function and method call microbenchmarks:
    • call_simple - positional arguments-only function calls.
    • call_method - positional arguments-only method calls.
    • call_method_slots - method calls on classes that use __slots__.
    • call_method_unknown - method calls where the receiver cannot be predicted.
  • django - use the Django template system to build a 150x150-cell HTML table.
  • float - artificial, floating point-heavy benchmark originally used by [Factor.
  • html5lib - parse the HTML 5 spec using html5lib.
  • html5lib_warmup - like html5lib, but gives the JIT a chance to warm up by doing the iterations in the same process.
  • nbody - the N-body Shootout benchmark. Microbenchmark for floating point operations.
  • nqueens - small solver for the N-Queens problem.
  • pickle - use the cPickle module to pickle a variety of datasets.
  • pickle_dict - microbenchmark; use the cPickle module to pickle a lot of dicts.
  • pickle_list - microbenchmark; use the cPickle module to pickle a lot of lists.
  • pybench - run the standard Python PyBench benchmark suite. This is considered an unreliable, unrepresentative benchmark; do not base decisions off it. It is included only for completeness.
  • regex - collection of regex benchmarks:
    • regex_compile - stress the performance of Python's regex compiler, rather than the regex execution speed.
    • regex_effbot - some of the original benchmarks used to tune mainline Python's current regex engine.
    • regex_v8 - Python port of V8's regex benchmark.
  • richards - the classic Richards benchmark.
  • rietveld - macrobenchmark for Django using the Rietveld code review app.
  • slowpickle - use the pure-Python pickle module to pickle a variety of datasets.
  • slowspitfire - use the Spitfire template system to build a 1000x1000-cell HTML table. Unlike the spitfire benchmark listed below, slowspitfire does not use Psyco.
  • slowunpickle - use the pure-Python pickle module to unpickle a variety of datasets.
  • spitfire - use the Spitfire template system to build a 1000x1000-cell HTML table, taking advantage of Psyco for acceleration.
  • spambayes - run a canned mailbox through a SpamBayes ham/spam classifier.
  • startup - collection of microbenchmarks focused on Python interpreter start-up time:
    • bzr_startup - get Bazaar's help screen.
    • hg_startup - get Mercurial's help screen.
    • normal_startup - start Python, then exit immediately.
    • startup_nosite - start Python with the -S option, then exit immediately.
  • threading - collection of microbenchmarks for Python's threading support. These benchmarks come in pairs: an iterative version (iterative_foo), and a multithreaded version (threaded_foo).
    • threaded_count, iterative_count - spin in a while loop, counting down from a large number.
  • unpack_sequence - microbenchmark for unpacking lists and tuples.
  • unpickle - use the cPickle module to unpickle a variety of datasets.

Benchmark Groups

We have grouped the above benchmarks into a number of categories. These categories are called "benchmark groups" in perf.py, and are runnable just like the individual benchmarks; running a benchmark group will run all benchmarks in that group.

Groups:

  • apps: 2to3, html5lib, rietveld, spambayes
  • calls: call_simple, call_method, call_method_slots, call_method_unknown
  • cpickle: pickle, unpickle
  • default: 2to3, django, nbody, slowspitfire, slowpickle, slowunpickle, spambayes
  • math: float, nbody
  • regex: regex_compile, regex_effbot, regex_v8
  • startup: bzr_startup, hg_startup, normal_startup, startup_nosite
  • threading: iterative_count, threaded_count

The default benchmark group is the main group we use to assess pure-Python application performance. Other groups are more specialized. Use the group most appropriate to your optimization, but always check for an impact on the default group.

Memory benchmarking

perf.py supports a --track_memory option that will continuously sample the benchmark process's memory usage throughout the process's lifetime. It will then compare the maximum memory usage for the control and experiment Python binaries, and will give the user a link to follow to see memory usage over time. Example graph:

The Y axis is memory usage in kilobytes, the X axis corresponds to time.

Benchmarks we don't use

We do not include PyBench, PyStone or Richards in our default benchmark suite. PyStone and Richards are synthetic benchmarks that may or may not translate into improved performance for real-world applications. We would like to avoid basing decisions on PyStone or Richards, only to find out that a real application sees no benefit -- or worse, is slowed down. In both cases, these benchmarks have a long history and have gone through many translations: PyStone was originally written in Ada, then translated to C, then translated to Python and does not represent idiomatic Python code or its performance hot spots. Richards was originally written in BCPL, then translated to Smalltalk, then to C++, then to Java and finally to Python; it does a little better at testing OO performance, but doesn't involve string processing at all, something that many Python applications rely on heavily. Also, it is not idiomatic Python code.

While PyBench may be an acceptable collection of microbenchmarks, it is not a reliable or precise benchmark. We have observed swings of up to 10% between runs on unloaded machines using the same version of Python; we would like to detect performance differences of 1% accurately. For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower. For this reason, we do not factor in PyBench results to our decision-making.

Beyond these benchmarks, there are also a variety of workloads we're explicitly not interested in benchmarking. Unladen Swallow is focused on improving the performance of pure Python code, so the performance of extension modules like numpy is uninteresting since numpy's core routines are implemented in C. Similarly, workloads that involve a lot of IO like GUIs, databases or socket-heavy apps would, we feel, be inappropriate. That said, there's certainly room to improve the performance of C-language extensions modules in the standard library; we've done this for cPickle and will do this for re. The performance of non-standard extension modules, though, is less interesting.

Comment by showel...@yahoo.com, Jan 29, 2010

It might be good to say "/path/to/control/python" and "/path/to/experimental/python" in the examples, to make it a little more clear that those parameters need to be accurate. I got bitten on this gotcha, and the error you get when you do this is kind of a red herring. Once I realized it, I felt kind of stupid, but I saw on the mailing list that I'm not the only one who made this mistake.

Once I got over that, it was nothing but glorious goodness. What a great tool, and I like how it automatically posts the graphs.

Comment by antoine....@gmail.com, Sep 6, 2010

It seems the SVN instructions should be removed and replaced with:

hg clone http://hg.python.org/benchmarks

no?


Sign in to add a comment
Powered by Google Project Hosting