|
ProjectPlan
Plans for optimizing Python
NB: all cited papers are linked on RelevantPapers. A version of this document is also available in Chinese, though we can't vouch for the translation. GoalsWe want to make Python faster, but we also want to make it easy for large, well-established applications to switch to Unladen Swallow.
OverviewIn order to achieve our combination of performance and compatibility goals, we opt to modify CPython, rather than start our own implementation from scratch. In particular, we opt to start working on CPython 2.6.1: Python 2.6 nestles nicely between 2.4/2.5 (which most interesting applications are using) and 3.x (which is the eventual future). Starting from a CPython release allows us to avoid reimplementing a wealth of built-in functions, objects and standard library modules, and allows us to reuse the existing, well-used CPython C extension API. Starting from a 2.x CPython release allows us to more easily migrate existing applications; if we were to start with 3.x, and ask large application maintainers to first port their application, we feel this would be a non-starter for our intended audience. The majority of our work will focus on speeding the execution of Python code, while spending comparatively little time on the Python runtime library. Our long-term proposal is to supplement CPython's custom virtual machine with a JIT built on top of LLVM, while leaving the rest of the Python runtime relatively intact. We have observed that Python applications spend a large portion of their time in the main eval loop. In particular, even relatively minor changes to VM components such as opcode dispatch have a significant effect on Python application performance. We believe that compiling Python to machine code via LLVM's JIT engine will deliver large performance benefits. Some of the obvious benefits:
With the infrastructure to generate machine code comes the possibility of compiling Python into a much more efficient implementation than what would be possible in the current bytecode-based representation. For example, take the snippet for i in range(3): foo(i) This currently desugars to something like $x = range(3)
while True:
try:
$y = $x.next()
except StopIteration:
break
i = $y
foo(i)Once we have a mechanism to know that range() means the range() builtin function, we can turn this into something more akin to for (i = 0; i < 3; i++) foo(i) in C, possibly using unboxed types for the math. We can then unroll the loop to yield foo(0) foo(1) foo(2) We intend to structure Unladen Swallow's internals to assume that multiple cores are available for our use. Servers are only going to acquire more and more cores, and we want to exploit that to do more and more work in parallel. For example, we would like to have a concurrent code optimizer that applies increasingly-expensive (and -beneficial!) optimizations in parallel with code execution, using another core to do the work. We are also considering a concurrent garbage collector that would, again, utilize additional cores to offload work units. Since most production server machines are shipping with between 4 and 32 cores, we believe this avenue of optimization is potentially lucrative. However, we will have to be sensitive to the needs of highly-parallel applications and not consume extra cores blindly. Note that many of the areas we will need to address have been considered and developed by the other dynamic language implementations like MacRuby, JRuby, Rubinius and Parrot, and in particular other Python implementations like Jython, PyPy, and IronPython. In particular, we're looking at these other implementations for ideas on debug information, regex performance ideas, and generally useful performance ideas for dynamic languages. This is all fairly well-trodden ground, and we want to avoid reinventing the wheel as much as possible. MilestonesUnladen Swallow will be released every three months, with bugfix releases in between as necessary. 2009 Q1Q1 will be spent making relatively minor tweaks to the existing CPython implementation. We aim for a 25-35% performance improvement over our baseline. Our goals for this quarter are conservative, and are aimed at delivering tangible performance benefits to client applications as soon as possible, that is, without waiting until the completion of the project. Ideas for achieving this goal:
The 2009Q1 release can be found in the release-2009Q1-maint branch. See Release2009Q1 for our performance relative to CPython 2.6.1. 2009 Q2Q2 focused on supplementing the Python VM with a functionally-equivalent implementation in terms of LLVM. We anticipated some performance improvement, but that was not the primary focus of the 2009Q2 release; the focus was just on getting something working on top of LLVM. Making it faster will come in subsequent quarters. Goals:
The 2009Q2 release can be found in the release-2009Q2-maint branch. See Release2009Q2 for our performance relative to 2009Q1. 2009 Q3 and BeyondThe plan for Q3 onwards is to simply iterate over the literature. We aspire to do no original work, instead using as much of the last 30 years of research as possible. See RelevantPapers for a partial (by no means complete) list of the papers we plan to implement or draw upon. We plan to address performance considerations in the regular expression engine, as well as any other extension modules found to be bottlenecks. However, regular expressions are already known to be a good target for our work and will be considered first for optimization. In addition, we intend to remove the GIL and fix the state of multithreading in Python. We believe this is possible through the implementation of a more sophisticated GC system, something like IBM's Recycler (Bacon et al, 2001). Our long-term goal is to make Python fast enough to start moving performance-important types and functions from C back to Python. The exact performance targets for the 2009Q3 release will be finalized some time in Q2. Detailed PlansJIT CompilationWe plan to start with a simple, easy-to-implement JIT compiler, then add complexity and sophistication as warranted. We will start by implementing a simple, function-at-a-time compiler that takes the CPython bytecode and converts it to machine code via LLVM's internal intermediate representation (IR). The initial implementation of this bytecode-to-machine code compiler will be done by translating the code in CPython's interpreter loop to calls to LLVM's IRBuilder. We will apply only a small subset of LLVM's available optimization passes (current passes), since it's not clear that, for unoptimized Python IR, the extra compilation time spent in the optimizers will be compensated with increased execution performance. We've experimented with using LLVM's fast instruction selector (FastISel), but it fails over to the default, slower instruction selection DAG with our current IR; we may revisit this in the future. We've chosen a function-at-a-time JIT instead of a tracing JIT (Gal et al, 2007) because we believe whole-function compilation is easier to implement, given the existing structure of CPython bytecode. That is not to say we are opposed to a tracing JIT; on the contrary, implementing a whole-function compiler will provide a large amount of the necessary infrastructure for a tracing JIT, and a whole-function JIT will serve as a valuable baseline for comparing the performance of any proposed tracing implementations. LLVM's analysis libraries already have some support for specially optimizing hot traces (in Trace.cpp) that we may be able to take advantage of if we pursue a tracing JIT. We can get some of the benefits from having the instrumentation for the planned feedback-directed optimization system track taken/not-taken branches and pruning the not-taken branches from the generated machine code. Since we only wish to spend time optimizing code that will benefit the most -- the program's hot spots -- we will need a way to model hotness. Like in the rest of the JIT compiler, we plan to start simple and add complexity and sophistication as the benchmark results warrant. Our initial model will be very simple: if a function is called 10000 times, it is considered hot and is sent to LLVM for compilation to machine code. This model is obviously deficient, but will serve as a baseline for improvement. Enhancements we're considering: using interpreter loop ticks instead of function calls; aggregating the hotness level of leaf functions up the call tree. In the case of long-running loops, we probably won't try to replace those loops mid-execution, though it should be possible to do this if we desire. Even at its most naive, the generated machine code differs from its analogue in the interpreter loop in important ways:
In the initial draft of the JIT compiler, we will block execution while compiling hot functions. This is expensive. We will shift compilation to a separate worker thread using FIFO work queues. This will include instrumentation for work unit throughput, execution pause times, and temporal clustering so that we can measure the impact on execution time. Rubinius developed this technique independently, has seen success using it to reduce pause times in execution. We believe this strategy will allow us to perform potentially more expensive optimizations than if we had to always block execution on compilation/optimization (ala Self). This work is being tracked in issue 40. Feedback-Directed OptimizationWe wish to make use of the wealth of information available at runtime to optimize Python programs. Sophisticated implementations of Self, Java and JavaScript have demonstrated the real-world applicability of these techniques, and Psyco has demonstrated some applicability to Python. Accordingly, we believe it will be profitable to make use of runtime information when optimizing Unladen Swallow's native code generation. BackgroundUnladen Swallow compiles Python code to a bytecode representation that is amenable to reasonably-performant execution in a custom virtual machine. Once a piece of code has been deemed hot, we compile the bytecode to native code using LLVM. It is this native code that we seek to optimize. Optimizing the execution of the generated bytecode is less interesting since the system should be selecting most performance-critical functions for compilation to native code. However, modifications to the bytecode to enable easier compilation/profiling are fair game. Points
The initial round of data-gathering infrastructure (for types and branches) was added in r778. Places we would like to optimize (non-exhaustive)
Regular ExpressionsWhile regexes aren't a traditional performance hotspot, we've found that most regex users expect them to be faster than they are, resulting in surprising performance degradation. For this reason, we'd like to invest some resources in speeding up CPython's regex engine. CPython's current regex engine is a stack-based bytecode interpreter. It does not take advantage of any form of modern techniques to improve opcode dispatch performance (Bell 1973; Ertl & Gregg, 2003; Berndl et al, 2005) and is in other respects a traditional, straightforward virtual machine. We believe that many of the techniques being applied to speed up pure-Python performance are equally applicable to regex performance, starting at improved opcode dispatch all the way through JIT-compiling regexes down to machine code. Recent work in the Javascript community has confirmed our belief. Google's V8 engine now includes Irregexp, a JIT regex compiler, and the new SquirrelFish Extreme includes a new regex engine based on the same principle: trade JIT compilation time for execution time. Both of these show impressive gains on the regex section of the various Javascript benchmarks. We would like to replicate these results for CPython. We also considered using Thompson NFAs for very simple regexes, as advocated by Russ Cox. This would create a multi-engine regex system that could choose the fastest way of implementation any given pattern. The V8 team also considered such a hybrid system when working on Irregexp but rejected it, saying The problem we ran into is that not only backreferences but also basic operators like | and * are defined in terms of backtracking. To get the right behavior you may need backtracking even for seemingly simple regexps. Based on the data we have for how regexps are used on the web and considering the optimizations we already had in place we decided that the subset of regexps that would benefit from this was too small. One problem that needs to be overcome before any work on the CPython regex engine begins is that Python lacks a regex benchmark suite. We might be able to reuse the regexp.js component of the V8 benchmarks, but we would first need to verify that these are representative of the kind of regular expressions written in Python programs. We have no reason to believe that regexes used in Python programs differ significantly from those written in Javascript, Ruby, Perl, etc programs, but we would still need to be sure. Start-up TimeIn talking to a number of heavy Python users, we've gotten a lot of interest in improving Python's start-up time. This comes from both very large-scale websites (who want faster server restart times) and from authors of command-line tools (where Python start time might dwarf the actual work done). Start-up time is currently dominated by imports, especially for large applications like Bazaar. Python offers a lot of flexibility by deferring imports to runtime and providing a lot of hooks for configuring exactly how imports will work and where modules can be imported from. The price for that flexibility is slower imports. For large users that don't take advantage of that flexibility -- in particular servers, where imports shouldn't change between restarts -- we might provide a way to opt in to stricter, faster import semantics. One idea is to ship all required modules in a single, self-contained "binary". This would both a) avoid multiple filesystem calls for each import, and b) open up the possibility of Python-level link-time optimization, resulting in faster code via inter-module inlining and similar optimizations. Self-contained images like this would be especially attractive for large Python users in the server application space, where hermetic builds and deployments are already considered essential. A less invasive optimization would be to speed up Python's marshal module, which is used for .pyc and .pyo files. Based on Unladen Swallow's work speeding up cPickle, similarly low-hanging fruit probably exists in marshal. We already have benchmarks tracking start-up time in a number of configurations. We will probably also add microbenchmarks focusing specifically on imports, since imports currently dominate CPython start time. Testing and MeasurementPerformanceUnladen Swallow maintains a directory of interesting performance tests under the tests directory. perf.py is the main interface to the benchmarks we care about, and will take care of priming runs, clearing *.py[co] files and running interesting statistics over the results. Unladen Swallow's benchmark suite is focused on the hot spots in major Python applications, in particular web applications. The major web applications we have surveyed have indicated that they bottleneck primarily on template systems, and hence our initial benchmark suite focuses on them:
There are also a number of microbenchmarks, for example, an N-Queens solver, an alphametics solver and several start-up time benchmarks. Apart from these, our benchmark suite includes several crap benchmarks like Richards, PyStone and PyBench; these are only included for completeness and comparison with other Python implementations, which have tended to use them. Unladen Swallow does not consider these benchmarks to be representative of real Python applications or Python implementation performance, and does not run them by default or make decisions based on them. For charting the long-term performance trend of the project, Unladen Swallow makes use of Google's standard internal performance measurement framework. Project members will post regular performance updates to the mailing lists. For testing individual changes, however, using perf.py as described on the Benchmarks page is sufficient. CorrectnessIn order to ensure correctness of the implementation, Unladen Swallow uses both the standard Python test suite, plus a number of third-party libraries that are known-good on Python 2.6. In particular, we test third-party C extension modules, since these are the easiest to break via unwitting changes at the C level. As work on the JIT implementation moves forward, we will incorporate a fuzzer into our regular test run. We plan to reuse Victor Stinner's Fusil Python fuzzer as much as possible, since it a) exists, and b) has been demonstrated to find real bugs in Python. Unladen Swallow will come with a --jit option that can be used to control when the JIT kicks in. For example, --jit=never would disable the JIT entirely, while --jit=always would skip the warm-up interpreted executions and jump straight into native code generation; --jit=once will disable recompilation. These options will be used to test the various execution strategies in isolation. Our goal is to avoid JIT bugs that are never encountered because the buggy function isn't hot enough, as have been observed in the JVM (likewise for the interpreted mode). Unladen Swallow maintains a BuildBot instance that runs the above tests against every commit to trunk. ComplexityOne of CPython's virtues is its simplicity: modifying CPython's VM and compiler is relatively simple and straight-forward. Our work with LLVM will inevitably introduce more complexity into CPython's implementation. In order to measure the productivity trade-offs that may result from this extra machinery, the Unladen Swallow team will periodically take ideas from the python-dev and python-ideas mailing lists and implement them. If implementation is significantly more difficult that the corresponding change to CPython, that's obviously something that we'll need to address before merger. We may also get non-team members to do the implementations so that we get a less biased perspective. Risks
Lessons LearnedThis section attempts to list the ways that our plans have changed as work has progressed, as we've read more papers and talked to more people. This list is incomplete and will only grow.
CommunicationAll communication about Unladen Swallow should take place on the Unladen Swallow list. This is where design issues will be discussed, as well as notifications of continuous build results, performance numbers, code reviews, and all the other details of an open-source project. If you add a comment below, on this page, we'll probably miss it. Sorry. Please mail our list instead. |
Sign in to add a comment
Why not apply this work to Python 3? If you are successful it may encourage the adoption of Python 3 more quickly (and you obviously won't have to port this work to Python 3). --Nick
chinbillybilbo: the applications we are trying to speed up all use Python 2.x. If we required all these applications to first port to Python 3 in order to get any performance benefit, we feel that would be a non-starter for the applications we're focusing on.
Very interesting.
W.r.t. the GC: Apple released their GC under the Apache license (http://www.opensource.apple.com/darwinsource/10.5.5/autozone-77.1/). Its finalizer semantics seem slightly incompatibel with that of Python, in that a finalizer may not revive an object, but it might be possible to work around that.
Reading through all of this got me really excited about this project, and then I read the killer: "Will probably kill Python Windows support (for now).". For integration reasons, my company is currently stuck on Windows!
Anyways, I'm still excited about this project! Making Python faster can only be good for the community!
the creator of ruby is twitting this http://twitter.com/yukihiro_matz
Today's ars technica article http://arstechnica.com/open-source/news/2009/03/google-launches-project-to-boost-python-performance-by-5x.ars calls this a Google project. But I see no such claim, just the fact that you have access to Google's performance measurement framework. And of course, anyone can start a code.google.com project. Can you clarify for those curious?
I think a mainline branch of Python will need to continue to provide an interpreter, at least as a compile-time option. You can't beat plain old C for portability, and LLVM has a non-trivial memory footprint (if you load up all the optimization modules) which might not work so well for embedded systems.
Maintaining the C API compatibility will also be an interesting problem. Removing the GIL, creating an advanced garbage collector, and moving away from reference counting will probably go hand-in-hand, and the refcount-oriented single-threaded C API is not going to make things easy. You may want to leave open the possibility of a designing new C API (sans refcounting) while supporting the legacy API using proxy objects or object handles -- you can't have a copying/compacting GC if C code has direct pointers to Python objects.
Is it possible to have a big of background on who is financing the project, history of the persons doing it ?
The plans looks very real but it sounds strange to see a "let's make python 5x faster in less than one year", without even Guido participating.
Here at Red Hat we use Python for a lot of things. What we've observed is that execution performance is not the main issue (although it improving it would be greatly appreciated), rather it's the memory footprint which is the problem we most often encounter. If anything can be done to reduce the massive amount of memory Python uses it would be a huge win. I would encourage you to consider memory usage as just as important a goal as execution speed if you're going to tackle optimizing CPython.
This project is Google-financed, but not Google-owned. The two engineers working on this are full-time Google engineers in the compiler optimization team, working on this as their main project. We have other Googlers contributing patches in their 20% time.
Despite that, this is not Google's property. We are pushing changes upstream as quickly as we can, some of which are already in CPython mainline. Google cares a great deal about performance, and because we realize that other people do too, we want to contribute our work back to the open-source world so that everyone can reap the benefit.
This is wonderful! I was hoping Google will start such a project, since Python is one of Google's official languages. I'll be following this during the next months. A big thanks for sharing your work!
Awesome project. LLVM is great for VM's and Python need this. Apple use LLVM for their OS architecture and graphics with OpenGL. It could be that Python will replace Java in the future.
There was another "Python3k" project to use a 64-bit architecture on top of the Apache Portable Runtime for the VM called Prothon. Check out this comp.lang.python thread:
http://groups.google.com/group/comp.lang.python.announce/browse_thread/thread/1e6ebaa7b2c98994/acb0e1edb2ca449a?lnk=st&q=comp.lang.python+prothon#acb0e1edb2ca449a
(or google? search: comp.lang.python prothon hahn collins)
...also check out this follow up which includes suggestions that should have made it into python3k but didn't:
http://coding.derkeiler.com/Archive/Python/comp.lang.python/2004-03/4822.html (search: comp.lang.python prothon hahn zipher python3k)
frikker: I think I wasn't sarcastic, but upset, because the project seemed to completely ignore PyPy?. But I wish them luck, too. Would love to see more collaboration between projects, actually.
I strongly second the need for a smaller memory footprint. This is especially problematic on x86_64 hosts where python typically needs 2x the RAM of the ix86 interpreter to run an application. In contrast, C applications seem to take 1.25x the RAM on x86_64.
It's good to see Python optimization efforts that focus on compatibility, specifically with CPython extensions. The language forks such as PyPy?, Stackless, Jython, IronPython?, PyVM, etc I'm sure are useful to some people, but a large class of Python users relies on CPython extensions.
Shameless plug: there will be an Unladen Swallow talk at PyCon? Italy: http://www.pycon.it/conference/talks/improving-air-speed-velocity-python
PyCon? Italy will see 400 people attending and real-time ita->eng translations for the main track (so speaking Italian is not a showstopper to attend!) http://www.pycon.it/pycon3/non-italians/
The whole website is in English too. Early bird is still open.
Which exactly benefits do you get of using LLVM instead of libJIT? libJIT has already been tested in Portable.NET JIT, it reaches and beats Mono JIT performance. Is Python that different from .NET and Common Intermediate Language?
Regarding concerns about mem usage, current and future: Are there thoughts about splitting the runtime into a lightweight (handheld) client and a heavyweight (Google) backend?
chphilli: I don't see where they are going to 'kill Python Windows support for now'. Page changed? That would be a bummer because as of May 09, most kids have windows still, and python is a great language for learning. It'd be a shame to lose that demographic. Amongst the massive list of technical challenges, navigation of competing demands, and admirable goals, saying "oh we don't even have windows machines" sticks out as kind of a bubble-ish 1337 cop-out. I call you on it out of love.
hrfeels: don't worry! i'm convinced the official python implementation will continue to support all the major platforms including windows. what they meant is that their currently developed optimized implementation of python is not windows compatible(or at least not tested)