|
Benchmarks
Description of the benchmarks used for Unladen Swallow.
Unladen Swallow BenchmarksThe Unladen Swallow benchmark suite is kept in the tests/ directory (note that this is in the /svn/tests/ tree, parallel to trunk). tests/perf.py is the main interface to the tests, with the individual benchmarks stored under the tests/performance/ directory. To check out the latest version of the benchmarks: svn checkout http://unladen-swallow.googlecode.com/svn/tests unladen-bmarks Example perf.py command: python2.5 unladen-bmarks/perf.py -r --benchmarks=2to3,django control/python experiment/python This will run the 2to3 and Django template benchmarks in rigorous mode (lots of iterations), taking control/python as the baseline and experiment/python as the binary you've been mucking around with. perf.py will take care of comparing the performance and running statistics on the result to determine statistical significance. Quick-Start GuideNot all benchmarks are created equal: some of the benchmarks listed below are more useful than others. If you're interested in overall system performance, the best guide is this: python unladen-bmarks/perf.py -r -b default control/python experiment/python That will run the benchmarks we consider the most important headline indicators of performance. There's an additional collection of whole-app benchmarks that are important, but take longer to run: python unladen-bmarks/perf.py -r -b apps control/python experiment/python Benchmarks
Benchmark GroupsWe have grouped the above benchmarks into a number of categories. These categories are called "benchmark groups" in perf.py, and are runnable just like the individual benchmarks; running a benchmark group will run all benchmarks in that group. Groups:
The default benchmark group is the main group we use to assess pure-Python application performance. Other groups are more specialized. Use the group most appropriate to your optimization, but always check for an impact on the default group. Memory benchmarkingperf.py supports a --track_memory option that will continuously sample the benchmark process's memory usage throughout the process's lifetime. It will then compare the maximum memory usage for the control and experiment Python binaries, and will give the user a link to follow to see memory usage over time. Example graph:
The Y axis is memory usage in kilobytes, the X axis corresponds to time. Benchmarks we don't useWe do not include PyBench, PyStone or Richards in our default benchmark suite. PyStone and Richards are synthetic benchmarks that may or may not translate into improved performance for real-world applications. We would like to avoid basing decisions on PyStone or Richards, only to find out that a real application sees no benefit -- or worse, is slowed down. In both cases, these benchmarks have a long history and have gone through many translations: PyStone was originally written in Ada, then translated to C, then translated to Python and does not represent idiomatic Python code or its performance hot spots. Richards was originally written in BCPL, then translated to Smalltalk, then to C++, then to Java and finally to Python; it does a little better at testing OO performance, but doesn't involve string processing at all, something that many Python applications rely on heavily. Also, it is not idiomatic Python code. While PyBench may be an acceptable collection of microbenchmarks, it is not a reliable or precise benchmark. We have observed swings of up to 10% between runs on unloaded machines using the same version of Python; we would like to detect performance differences of 1% accurately. For us, the final nail in PyBench's coffin was when experimenting with gcc's feedback-directed optimization tools, we were able to produce a universal 15% performance increase across our macrobenchmarks; using the same training workload, PyBench got 10% slower. For this reason, we do not factor in PyBench results to our decision-making. Beyond these benchmarks, there are also a variety of workloads we're explicitly not interested in benchmarking. Unladen Swallow is focused on improving the performance of pure Python code, so the performance of extension modules like numpy is uninteresting since numpy's core routines are implemented in C. Similarly, workloads that involve a lot of IO like GUIs, databases or socket-heavy apps would, we feel, be inappropriate. That said, there's certainly room to improve the performance of C-language extensions modules in the standard library; we've done this for cPickle and will do this for re. The performance of non-standard extension modules, though, is less interesting. |
It might be good to say "/path/to/control/python" and "/path/to/experimental/python" in the examples, to make it a little more clear that those parameters need to be accurate. I got bitten on this gotcha, and the error you get when you do this is kind of a red herring. Once I realized it, I felt kind of stupid, but I saw on the mailing list that I'm not the only one who made this mistake.
Once I got over that, it was nothing but glorious goodness. What a great tool, and I like how it automatically posts the graphs.
It seems the SVN instructions should be removed and replaced with:
no?