My favorites | Sign in
Google
                
Search
for
Updated Oct 20, 2009 by collinw
Labels: Featured
GettingStarted  
How to build Unladen Swallow

The basics

Setting up Unladen Swallow uses the same procedure as setting up CPython:

> svn checkout http://unladen-swallow.googlecode.com/svn/branches/release-2009Q3-maint unladen
...
> cd unladen
> ./configure
...
> make
...
> ./python.exe
Python 2.6.1 (r261:311:312M, Oct 14 2009, 23:24:25) 
[GCC 4.0.1 (Apple Inc. build 5490)] on darwin
[Unladen Swallow 2009Q3]
Type "help", "copyright", "credits" or "license" for more information.
>>>

This will checkout and build our 2009Q3 release. Note that our tests/ top-level directory uses Subversion 1.5-style relative svn:externals properties; accordingly, you'll need SVN 1.5 or higher.

Other interesting checkout targets:

Active development is being done in trunk/. We try to keep trunk stable and correct at all times, but there may be bugs that have yet to be addressed. Caveat downloader.

If you're building the 2009Q2 release on a 32/64-bit hybrid system (say, a 64-bit kernel but a 32-bit userspace), you'll need to run a different ./configure command. In the case of a 32-bit userspace, something like this should work:

CFLAGS=-m32 CXXFLAGS=-m32 ./configure --build=i386-unknown-linux-gnu

Working on Unladen Swallow

We maintain a list of good volunteer projects in our issue tracker under the label StarterProject. Look over these and let us know if any strike your fancy.

There's also a category called Beer. The Beer tag indicates tasks that aren't exactly sexy, but need to get done. As a thank-you for taking on one of these tasks, the Googlers on the team will buy you a round at a conference. Seriously.

Any patches should follow our style guide and be put on http://codereview.appspot.com and sent to unladen-swallow@googlegroups.com for pre-commit review.

To upload a patch, download upload.py and go to your checkout directory. Pick some project members as reviewers, and invoke upload.py like so:

upload.py -e EMAIL@gmail.com -r REVIEWERS --cc=unladen-swallow@googlegroups.com --send_mail

Improving generated code

The first step to improving the code we generate is to look at it. In Unladen Swallow, every function has four representations. First, the Python code:

def sum(x):
  result = 0
  for i in x:
    result += i
  return result

This is compiled into CPython bytecode, which you can inspect with the dis module:

>>> import dis
>>> dis.dis(sum)
  2           0 LOAD_CONST               1 (0)
              3 STORE_FAST               1 (result)

  3           6 SETUP_LOOP              24 (to 33)
              9 LOAD_FAST                0 (x)
             12 GET_ITER            
        >>   13 FOR_ITER                16 (to 32)
             16 STORE_FAST               2 (i)

  4          19 LOAD_FAST                1 (result)
             22 LOAD_FAST                2 (i)
             25 INPLACE_ADD         
             26 STORE_FAST               1 (result)
             29 JUMP_ABSOLUTE           13
        >>   32 POP_BLOCK           

  5     >>   33 LOAD_FAST                1 (result)
             36 RETURN_VALUE        
>>> 

Doc/library/dis.rst documents what the opcodes mean.

Third, when a function is hot, the bytecode gets compiled to LLVM IR. You can force this compilation by setting func.__code__.co_optimization to an integer between -1 and 2 (which determines how much to optimize the code). Then print the bytecode with func.__code__.co_llvm:

>>> sum.__code__.co_optimization=1
>>> print sum.__code__.co_llvm

define %struct._object* @"#u#sum"(%struct._frame* %frame) {
entry:
	%exc_info = alloca %struct.PyExcInfo, align 4		; <%struct.PyExcInfo*> [#uses=4]
	%stack_pointer_addr = alloca %struct._object**, align 4		; <%struct._object***> [#uses=50]
	%call.i = call %struct._ts* @PyThreadState_Get() nounwind		; <%struct._ts*> [#uses=13]
	%use_tracing = getelementptr %struct._ts* %call.i, i32 0, i32 5		; <i32*> [#uses=1]
	%use_tracing1 = load i32* %use_tracing		; <i32> [#uses=1]
	%0 = icmp eq i32 %use_tracing1, 0		; <i1> [#uses=1]
	br i1 %0, label %continue_entry, label %trace_enter_function

... # Lots of IR

call_trace38:		; preds = %_PyLlvm_WrapXDecref.exit192
	%f_lasti39 = getelementptr %struct._frame* %frame, i32 0, i32 17		; <i32*> [#uses=1]
	store i32 13, i32* %f_lasti39
	%132 = call i32 @_PyLlvm_CallLineTrace(%struct._ts* %call.i, %struct._frame* %frame, %struct._object*** %stack_pointer_addr)		; <i32> [#uses=2]
	switch i32 %132, label %goto_line [
		i32 -2, label %propagate_exception
		i32 -1, label %JUMP_ABSOLUTE_target
	]
}

>>> 

Fourth, this code is JIT-compiled to native machine code. Unfortunately, there's no easy way to display this machine code. The easiest involves setting PYTHONLLVMFLAGS=-debug-only=jit before starting Python and running Python inside gdb with a breakpoint in _PyLlvmFunction_Eval() just before the call to native(frame). When _PyLlvmFunction_Eval() calls ExecutionEngine::getPointerToFunction(), the JIT will dump a lot of information including the location and size of the machine code:

$ PYTHONLLVMFLAGS=-debug-only=jit gdb ./python.exe 
...
(gdb) b _llvmfunctionobject.cc:69
Breakpoint 1 at 0xa72a0: file ../src/Objects/_llvmfunctionobject.cc, line 69.
(gdb) run
...
>>> def sum(x):
...   result = 0
...   for i in x:
...     result += i
...   return result
... 
>>> sum.__code__.__use_llvm__=True
>>> sum.__code__.co_optimization=1
>>> sum([1,2,3])
JIT: Starting CodeGen of Function #u#sum
...
JIT: Finished CodeGen of [0x2080020] Function: #u#sum: 2763 bytes of text, 214 relocations
JIT: Binary code:
JIT: 00000000: 56575355 e83cec83 fe0e2771 00147883 
...
JIT: 00000ac0: 8950244c 2fe9240c fffffc

Breakpoint 1, _PyLlvmFunction_Eval (function_obj=0x14a3208, frame=0x1552ad8) at ../src/Objects/_llvmfunctionobject.cc:69
69	    return native(frame);
(gdb) disassemble 0x2080020 (0x2080020 + 2763)
Dump of assembler code from 0x2080020 to 0x2080aeb:
0x02080020:	push   %ebp
0x02080021:	push   %ebx
0x02080022:	push   %edi
0x02080023:	push   %esi
0x02080024:	sub    $0x3c,%esp
...
0x02080adf:	mov    0x50(%esp),%ecx
0x02080ae3:	mov    %ecx,(%esp)
0x02080ae6:	jmp    0x208071a
End of assembler dump.
Current language:  auto; currently c++
(gdb) 

And there's the machine code for this function. If you link LLVM with libudis86, it'll disassemble this for you in the JIT debug output, but getting that link to work is non-trivial.

Reducing build times

By default, running make clean will clean both Python and the LLVM tree in Util/llvm. Rebuilding LLVM takes approximately forever (compared to the rest of Python), so there's a script to save you the need to rebuild LLVM over and over:

$ cd ~/unladen-swallow/trunk/Util/llvm
$ ./install-llvm release --prefix=/tmp/llvm
# Configures LLVM correctly, then runs make && make install
$ cd ../..  # Back down to ~/unladen-swallow/trunk
$ ./configure --with-llvm=/tmp/llvm && make

This will configure, build and install LLVM into /tmp/llvm, then reuse that directory when building Unladen Swallow. The LLVM installation in /tmp/llvm can be reused and shared among different Unladen Swallow object directories, saving you considerable build time. See install-llvm.sh for more details.

On OS X, Python comes with a suite of Carbon toolkit modules that we generally don't care about when working on Unladen Swallow. You can pass --disable-toolbox-glue to avoid wasting cycles building these modules you won't use. This brings build times down to what they are on Linux.

Performance analysis

Let's say you have a change you'd like to make to Python, and you'd like to see if it impacts performance. The main tool for this is the benchmarks available via perf.py (see Benchmarks for checkout instructions).

This will compare the performance of two Python binaries, a control binary and an experiment binary, on a benchmark based on Django template rendering.

$ ./perf.py -r -b django control/python experiment/python

perf.py -r will run the benchmarks in a more rigorous mode. In practice, this usually means increasing the number of iterations. When making judgements about the performance improvement/degradation caused by your change, you should always use -r.

perf.py will run some basic stats on the results for you, yielding the minimum running time, the arithmetic mean running time, the standard deviation and a two-tailed T-test to determine significance. If perf.py tells you that the performance change is insignificant or the printed t value is low (the absolute value is less than, say, five), it's probably right. The larger the t value, the more confident we are in the result.

If you want to pass arguments to the control or experiment binaries, use perf.py --args. This will compare the performance of Unladen Swallow's -O2 and -O3 flags on the Django templates benchmark:

$ ./perf.py -r -b django --args "-O2,-O3" control/python control/python

Improving startup performance

Python startup time is heavily dependent on the number of modules imported. If you can find a way to eliminate or delay an import (in either case, getting it out of the critical path for startup), that will usually improve startup time.

See which modules are required to do no work at all:

$ ./python.exe -v -c '' 2>&1 | grep ^import
import zipimport # builtin
import site # precompiled from /Users/collinwinter/src/us/trunk3/Lib/site.pyc
import os # precompiled from /Users/collinwinter/src/us/trunk3/Lib/os.pyc
import errno # builtin
import posix # builtin
import posixpath # precompiled from /Users/collinwinter/src/us/trunk3/Lib/posixpath.pyc
import stat # precompiled from /Users/collinwinter/src/us/trunk3/Lib/stat.pyc
import genericpath # precompiled from /Users/collinwinter/src/us/trunk3/Lib/genericpath.pyc
import copy_reg # precompiled from /Users/collinwinter/src/us/trunk3/Lib/copy_reg.pyc
import encodings # directory /Users/collinwinter/src/us/trunk3/Lib/encodings
import encodings # precompiled from /Users/collinwinter/src/us/trunk3/Lib/encodings/__init__.pyc
import codecs # precompiled from /Users/collinwinter/src/us/trunk3/Lib/codecs.pyc
import _codecs # builtin
import encodings.aliases # precompiled from /Users/collinwinter/src/us/trunk3/Lib/encodings/aliases.pyc
import encodings.utf_8 # precompiled from /Users/collinwinter/src/us/trunk3/Lib/encodings/utf_8.pyc
$

perf.py includes benchmarks for both normal startup and startup with the -S option (don't import site.py). These benchmarks are -b normal_startup and -b startup_nosite respectively, or use -b startup to run both.


Comment by tom.den...@amd.com, Dec 07, 2009

Does the default ./configure, make build the most optimized version of 2009Q3? I ask because some benchmarks were considerably slower compared to CPython 2.6.4

Comment by chaow...@google.com, Dec 11, 2009

"./install-llvm.sh release --prefix=/tmp/llvm" rather than "./install-llvm release --prefix=/tmp/llvm". :)


Sign in to add a comment