Posted on Jul 15, 2009 by Helpful Elephant

I've seen some cases where 2009Q2 uses 10x as much memory as 2009Q1. We need to track this down and eliminate as much of this bloat as possible. There will be some increase necessary since we're emitting machine code, but that doesn't account for a 10x blowup.

Things to look into: - Memory blow up with ./python -j whenhot - Memory blow up with ./python -j always - Memory blow up just from linking in LLVM - How much memory goes into the machine code vs the LLVM IR? The optimization passes?

perf.py's --track_memory option can help get rough estimates and to keep track of macro-level progress.

Comment #1

Posted on Jul 17, 2009 by Helpful Elephant

The current theory is that this comes from keeping the LLVM IR around after we've compiled the function to native code. Some preliminary measurements indicate that loading Django under -j always consumes over 65 MB in llvm::Functions and their child objects.

Comment #2

Posted on Jul 19, 2009 by Swift Camel

How difficult would it be to rig up a C++ heap checker? I'm not familiar with any tools like this, but I think we need to do a couple of things: make sure we have no leaks, and figure out which structures are eating all of our memory. Are there any tools out there for doing that?

Comment #3

Posted on Jul 19, 2009 by Helpful Cat

On OSX, I've used the MallocDebug tool to track down large memory users. Massif (http://valgrind.org/docs/manual/ms-manual.html) looks even better for Linux. I got a recommendation for KCacheGrind a while ago, but it looks more like a speed tool than a memory tool.

Comment #4

Posted on Jul 20, 2009 by Helpful Ox

Purify (now IBM Rational Purify, http://www-01.ibm.com/software/awdtools/purify/) used to be an excellent tool for finding issues in memory management. It didn't need source code instrumentation, actually it didn't need source code at all. It changed the assembly, injecting code around memory access instructions, to trap and monitor memory access, so it could be used on 3rd party code as well.

Comment #5

Posted on Jul 21, 2009 by Helpful Elephant

(No comment was entered for this change.)

Comment #6

Posted on Jul 24, 2009 by Helpful Elephant

The theory about LLVM is correct: r775 makes -j always compile all executed functions, rather than all functions present in the system. This has a significant effect on memory usage. Measuring using time -l on Darwin, max RSS for bm_django.py -n 1 with -j always goes from 165MB to 70MB.

Analysis using a prototype _llvm.dump_stats():

Old -j always (at definition)

LLVM functions: 1833 LLVM basic blocks: 229867 LLVM instructions: 1321540 LLVM data size: 63159048 bytes

New -j always (at execution)

LLVM functions: 342 LLVM basic blocks: 39870 LLVM instructions: 267272 LLVM data size: 12479360 bytes

There's still plenty of headroom here, and I have a CL to further reduce memory usage in progress. I need to rerun perf.py --track_memory to see the full effect on all benchmarks.

Comment #7

Posted on Jul 24, 2009 by Helpful Elephant

Benchmarking the memory usage of r774 vs r775 with --track_memory on Linux 2.6.18.5:

perf.py command: ./perf.py --track_memory --args "-j always" -b django,slowspitfire,slowpickle,slowunpickle,2to3 ../a/python ../b/python

Note that in the graphs, "../a/python" is the control binary (r774), "../b/python" is the experiment binary (r775). This confirms that this is a significant reduction in memory usage for -j always.

2to3: 59.246994 -> 57.129315: 3.71% faster Mem max: 141136.000 -> 85224.000: 65.61% smaller Usage over time: http://tinyurl.com/nlkf6g

django: Min: 1.122529 -> 1.075815: 4.34% faster Avg: 1.128422 -> 1.077450: 4.73% faster Significant (t=45.483821, a=0.95) Stddev: 0.00721 -> 0.00330: 118.44% smaller

Mem max: 158816.000 -> 67560.000: 135.07% smaller Usage over time: http://tinyurl.com/m7urd2

slowpickle: Min: 0.706008 -> 0.702213: 0.54% faster Avg: 0.708631 -> 0.703975: 0.66% faster Significant (t=5.179222, a=0.95) Stddev: 0.00575 -> 0.00271: 112.51% smaller

Mem max: 96592.000 -> 68360.000: 41.30% smaller Usage over time: http://tinyurl.com/l4lk7s

slowspitfire: Min: 0.966542 -> 0.796440: 21.36% faster Avg: 0.981962 -> 0.802562: 22.35% faster Significant (t=119.996555, a=0.95) Stddev: 0.00879 -> 0.00588: 49.42% smaller

Mem max: 212400.000 -> 132880.000: 59.84% smaller Usage over time: http://tinyurl.com/lncr8a

slowunpickle: Min: 0.309612 -> 0.308172: 0.47% faster Avg: 0.310206 -> 0.308482: 0.56% faster Significant (t=29.988479, a=0.95) Stddev: 0.00036 -> 0.00018: 98.98% smaller

Mem max: 96592.000 -> 68360.000: 41.30% smaller Usage over time: http://tinyurl.com/npo6rs

Note too that the performance improvement for slowspitfire doesn't hold when rerun without -j always and with --rigorous.

Comment #8

Posted on Aug 4, 2009 by Helpful Elephant

Mailed out a change for review that will throw away the LLVM IR after native code generation. This looks like a promising reduction for -j always.

Comment #9

Posted on Oct 12, 2009 by Helpful Elephant

With the new hotness model introduced in r862, memory usage has decreased significantly vs the Q2 release with -j always:

$ ./perf.py -r -b slowspitfire --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 212344.000 -> 96884.000: 119.17% smaller Usage over time: http://tinyurl.com/yfy3w3p

$ ./perf.py -r -b ai --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 95012.000 -> 14020.000: 577.69% smaller Usage over time: http://tinyurl.com/yz7v4xj

$ ./perf.py -r -b slowpickle --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 96876.000 -> 18996.000: 409.98% smaller Usage over time: http://tinyurl.com/yf4a3sj

$ ./perf.py -r -b slowunpickle --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 96876.000 -> 14076.000: 588.24% smaller Usage over time: http://tinyurl.com/yfzv2mn

$ ./perf.py -r -b django --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 159064.000 -> 27160.000: 485.66% smaller Usage over time: http://tinyurl.com/ykdmdml

$ ./perf.py -r -b rietveld --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 575116.000 -> 55952.000: 927.87% smaller Usage over time: http://tinyurl.com/yf3rcbb

We still have work to do to get closer to the 2.6 baseline (example):

$ ./perf.py -r -b rietveld --track_memory ../26/python ../q3/python Mem max: 14948.000 -> 55952.000: 73.28% larger Usage over time: http://tinyurl.com/ykcsbcs

That's not great, but it's way better than it used to be. Pushing future work on this to Q4.

Comment #10

Posted on Oct 26, 2009 by Helpful Elephant

r876 deletes LLVM IR after compiling to native code. This is a ~5% drop in memory for most benchmarks, but Rietveld sees a 37% drop:

Before: >>> _llvm.dump_stats() LLVM functions: 279 LLVM basic blocks: 25079 LLVM instructions: 136274

After: >>> _llvm.dump_stats() LLVM functions: 279 LLVM basic blocks: 45 LLVM instructions: 209

Mem max: 55936.000 -> 40804.000: 37.08% smaller Usage over time: http://tinyurl.com/yj4esk2

See http://codereview.appspot.com/99044 for more stats.

Rietveld vs Python 2.6.1 (./perf.py -r -b rietveld --track_memory ../26/python ../obj/python):

Mem max: 14884.000 -> 40040.000: 62.83% larger Usage over time: http://tinyurl.com/yj2nm3u

That's heading the right direction.

Comment #11

Posted on Jan 6, 2010 by Helpful Elephant

(No comment was entered for this change.)

Comment #12

Posted on Jan 11, 2010 by Helpful Elephant

Most recent memory usage benchmarking (Q3 release vs trunk):

Biggest improvement:

slowpickle

Mem max: 20048.000 -> 17732.000: 1.1306x smaller Usage over time: http://tinyurl.com/y8cbek4

spambayes

Mem max: 52200.000 -> 46632.000: 1.1194x smaller Usage over time: http://tinyurl.com/ybu9jo8

Total:

2to3

Mem max: 46776.000 -> 45940.000: 1.0182x smaller Usage over time: http://tinyurl.com/yavvxtx

django

Mem max: 26324.000 -> 26496.000: 1.0065x larger Usage over time: http://tinyurl.com/ye7wua8

html5lib

Mem max: 178044.000 -> 173952.000: 1.0235x smaller Usage over time: http://tinyurl.com/yaw2qs3

nbody

Mem max: 15852.000 -> 17372.000: 1.0959x larger Usage over time: http://tinyurl.com/ybgurwh

slowpickle

Mem max: 20048.000 -> 17732.000: 1.1306x smaller Usage over time: http://tinyurl.com/y8cbek4

slowspitfire

Mem max: 96972.000 -> 97652.000: 1.0070x larger Usage over time: http://tinyurl.com/ydx4q8z

slowunpickle

Mem max: 15144.000 -> 14844.000: 1.0202x smaller Usage over time: http://tinyurl.com/y8doccw

spambayes

Mem max: 52200.000 -> 46632.000: 1.1194x smaller Usage over time: http://tinyurl.com/ybu9jo8

Comment #13

Posted on Jan 15, 2010 by Helpful Elephant

TODO: finish http://codereview.appspot.com/97120/show, commit it.

Comment #14

Posted on Jan 25, 2010 by Helpful Elephant

r1034 lazily initializes the runtime feedback DenseMaps, which turns out to be a nice win on larger codebases:

$ ./perf.py -f -m ../unladen-trunk/python ../unladen-swallow/python

Report on Linux muikyl 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10 17:01:44 UTC 2009 x86_64 Total CPU cores: 8

2to3

Mem max: 35596.000 -> 33584.000: 1.0599x smaller Usage over time: http://tinyurl.com/ycn2n3u

django

Mem max: 36844.000 -> 31936.000: 1.1537x smaller Usage over time: http://tinyurl.com/yfl4x5g

nbody

Mem max: 22336.000 -> 21516.000: 1.0381x smaller Usage over time: http://tinyurl.com/yeghupn

slowpickle

Mem max: 22060.000 -> 20920.000: 1.0545x smaller Usage over time: http://tinyurl.com/y9cltou

slowspitfire

Mem max: 148188.000 -> 145324.000: 1.0197x smaller Usage over time: http://tinyurl.com/ygzvjjm

slowunpickle

Mem max: 16048.000 -> 14992.000: 1.0704x smaller Usage over time: http://tinyurl.com/ybrfyta

spambayes

Mem max: 39160.000 -> 35464.000: 1.1042x smaller Usage over time: http://tinyurl.com/yeuj4ga

Comment #15

Posted on Jan 27, 2010 by Helpful Elephant

r1044 shaved ~350kb off memory usage across the board. Another low-hanging sliver down, more to go.

Comment #16

Posted on Jan 27, 2010 by Helpful Elephant

Idea from Joerg: introduce a two-phase hotness model, where we only collect feedback after code objects have passed a certain minimum hotness threshold. This will take some careful tuning, since if the feedback threshold is too low, we may introduce more bail sites and degrade performance.

Comment #17

Posted on Feb 4, 2010 by Helpful Elephant

I've been experimenting with a two-phase hotness model, where we only gather feedback if co_hotness > 50 (or 100, or some other low number). I've added instrumentation to gather the number of feedback maps created:

$ ./perf.py -b nqueens --diff_instrumentation Feedback maps created: ../a-inst/python N: 149

../b-inst/python N: 49

$ ./perf.py -b django --diff_instrumentation Feedback maps created: ../a-inst/python N: 526

../b-inst/python N: 110

So we're definitely reducing the number of feedback maps created. But it doesn't matter:

$ ./perf.py -b nqueens --track_memory ../a/python ../b/python

nqueens

Mem max: 14224.000 -> 14100.000: 1.0088x smaller Usage over time: http://tinyurl.com/ykzofp2

$ ./perf.py -b slowspitfire --track_memory ../a/python ../b/python

slowspitfire

Mem max: 95540.000 -> 94464.000: 1.0114x smaller Usage over time: http://tinyurl.com/ydzt7t8

$ ./perf.py -b django --track_memory ../a/python ../b/python

django

Mem max: 23752.000 -> 23576.000: 1.0075x smaller Usage over time: http://tinyurl.com/yeqgk5n

Reducing the amount of data also results in bad branch predictions and other optimization degradations.

$ pprof --text ./python unladen-heap.0006.heap | head Total: 6.0 MB 3.0 49.9% 49.9% 3.0 49.9% new_arena (inline) 1.6 26.3% 76.2% 1.6 26.3% llvm::DenseMap::init (inline)

We use DenseMap for feedback, but we're not the only client: LLVM uses it all over the place. It may be that some optimization passes are keeping around data longer than we need/want them to. I'm not sure our feedback maps are the culprit here.

Comment #18

Posted on Feb 5, 2010 by Helpful Elephant

Follow up from pprof about feedback maps:

$ pprof --text ./python /tmp/control-nqueens.hprof.0003.heap | grep FeedbackMap_New 0.0 0.1% 99.6% 0.0 4.6% PyFeedbackMap_New $ pprof --text ./python /tmp/control-django.hprof.0002.heap | grep FeedbackMap_New 0.0 0.1% 99.3% 0.2 4.9% PyFeedbackMap_New $ pprof --text ./python /tmp/control-2to3.hprof.0003.heap | grep FeedbackMap_New 0.0 0.0% 99.7% 0.2 2.4% PyFeedbackMap_New

At 200k max, I don't think that's the problem. I'm mining pprof/tcmalloc data for more leads.

Be sure to run Unladen with -g; that changes the profile.

Comment #19

Posted on May 21, 2010 by Helpful Elephant

(No comment was entered for this change.)

unladen-swallow - issue #68

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Old -j always (at definition)

New -j always (at execution)

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

slowpickle

spambayes

2to3

django

html5lib

nbody

slowpickle

slowspitfire

slowunpickle

spambayes

Comment #13

Comment #14

2to3

django

nbody

slowpickle

slowspitfire

slowunpickle

spambayes

Comment #15

Comment #16

Comment #17

nqueens

slowspitfire

django

Comment #18

Comment #19