I've seen some cases where 2009Q2 uses 10x as much memory as 2009Q1. We need to track this down and eliminate as much of this bloat as possible. There will be some increase necessary since we're emitting machine code, but that doesn't account for a 10x blowup.
Things to look into: - Memory blow up with ./python -j whenhot - Memory blow up with ./python -j always - Memory blow up just from linking in LLVM - How much memory goes into the machine code vs the LLVM IR? The optimization passes?
perf.py's --track_memory option can help get rough estimates and to keep track of macro-level progress.
Comment #1
Posted on Jul 17, 2009 by Helpful ElephantThe current theory is that this comes from keeping the LLVM IR around after we've compiled the function to native code. Some preliminary measurements indicate that loading Django under -j always consumes over 65 MB in llvm::Functions and their child objects.
Comment #2
Posted on Jul 19, 2009 by Swift CamelHow difficult would it be to rig up a C++ heap checker? I'm not familiar with any tools like this, but I think we need to do a couple of things: make sure we have no leaks, and figure out which structures are eating all of our memory. Are there any tools out there for doing that?
Comment #3
Posted on Jul 19, 2009 by Helpful CatOn OSX, I've used the MallocDebug tool to track down large memory users. Massif (http://valgrind.org/docs/manual/ms-manual.html) looks even better for Linux. I got a recommendation for KCacheGrind a while ago, but it looks more like a speed tool than a memory tool.
Comment #4
Posted on Jul 20, 2009 by Helpful OxPurify (now IBM Rational Purify, http://www-01.ibm.com/software/awdtools/purify/) used to be an excellent tool for finding issues in memory management. It didn't need source code instrumentation, actually it didn't need source code at all. It changed the assembly, injecting code around memory access instructions, to trap and monitor memory access, so it could be used on 3rd party code as well.
Comment #5
Posted on Jul 21, 2009 by Helpful Elephant(No comment was entered for this change.)
Comment #6
Posted on Jul 24, 2009 by Helpful ElephantThe theory about LLVM is correct: r775 makes -j always compile all executed
functions, rather than all functions present in the system. This has a significant
effect on memory usage. Measuring using time -l
on Darwin, max RSS for
bm_django.py -n 1 with -j always goes from 165MB to 70MB.
Analysis using a prototype _llvm.dump_stats():
Old -j always (at definition)
LLVM functions: 1833 LLVM basic blocks: 229867 LLVM instructions: 1321540 LLVM data size: 63159048 bytes
New -j always (at execution)
LLVM functions: 342 LLVM basic blocks: 39870 LLVM instructions: 267272 LLVM data size: 12479360 bytes
There's still plenty of headroom here, and I have a CL to further reduce memory usage in progress. I need to rerun perf.py --track_memory to see the full effect on all benchmarks.
Comment #7
Posted on Jul 24, 2009 by Helpful ElephantBenchmarking the memory usage of r774 vs r775 with --track_memory on Linux 2.6.18.5:
perf.py command: ./perf.py --track_memory --args "-j always" -b django,slowspitfire,slowpickle,slowunpickle,2to3 ../a/python ../b/python
Note that in the graphs, "../a/python" is the control binary (r774), "../b/python" is the experiment binary (r775). This confirms that this is a significant reduction in memory usage for -j always.
2to3: 59.246994 -> 57.129315: 3.71% faster Mem max: 141136.000 -> 85224.000: 65.61% smaller Usage over time: http://tinyurl.com/nlkf6g
django: Min: 1.122529 -> 1.075815: 4.34% faster Avg: 1.128422 -> 1.077450: 4.73% faster Significant (t=45.483821, a=0.95) Stddev: 0.00721 -> 0.00330: 118.44% smaller
Mem max: 158816.000 -> 67560.000: 135.07% smaller Usage over time: http://tinyurl.com/m7urd2
slowpickle: Min: 0.706008 -> 0.702213: 0.54% faster Avg: 0.708631 -> 0.703975: 0.66% faster Significant (t=5.179222, a=0.95) Stddev: 0.00575 -> 0.00271: 112.51% smaller
Mem max: 96592.000 -> 68360.000: 41.30% smaller Usage over time: http://tinyurl.com/l4lk7s
slowspitfire: Min: 0.966542 -> 0.796440: 21.36% faster Avg: 0.981962 -> 0.802562: 22.35% faster Significant (t=119.996555, a=0.95) Stddev: 0.00879 -> 0.00588: 49.42% smaller
Mem max: 212400.000 -> 132880.000: 59.84% smaller Usage over time: http://tinyurl.com/lncr8a
slowunpickle: Min: 0.309612 -> 0.308172: 0.47% faster Avg: 0.310206 -> 0.308482: 0.56% faster Significant (t=29.988479, a=0.95) Stddev: 0.00036 -> 0.00018: 98.98% smaller
Mem max: 96592.000 -> 68360.000: 41.30% smaller Usage over time: http://tinyurl.com/npo6rs
Note too that the performance improvement for slowspitfire doesn't hold when rerun without -j always and with --rigorous.
Comment #8
Posted on Aug 4, 2009 by Helpful ElephantMailed out a change for review that will throw away the LLVM IR after native code generation. This looks like a promising reduction for -j always.
Comment #9
Posted on Oct 12, 2009 by Helpful ElephantWith the new hotness model introduced in r862, memory usage has decreased significantly vs the Q2 release with -j always:
$ ./perf.py -r -b slowspitfire --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 212344.000 -> 96884.000: 119.17% smaller Usage over time: http://tinyurl.com/yfy3w3p
$ ./perf.py -r -b ai --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 95012.000 -> 14020.000: 577.69% smaller Usage over time: http://tinyurl.com/yz7v4xj
$ ./perf.py -r -b slowpickle --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 96876.000 -> 18996.000: 409.98% smaller Usage over time: http://tinyurl.com/yf4a3sj
$ ./perf.py -r -b slowunpickle --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 96876.000 -> 14076.000: 588.24% smaller Usage over time: http://tinyurl.com/yfzv2mn
$ ./perf.py -r -b django --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 159064.000 -> 27160.000: 485.66% smaller Usage over time: http://tinyurl.com/ykdmdml
$ ./perf.py -r -b rietveld --args "-j always," --track_memory ../q2/python ../q3/python Mem max: 575116.000 -> 55952.000: 927.87% smaller Usage over time: http://tinyurl.com/yf3rcbb
We still have work to do to get closer to the 2.6 baseline (example):
$ ./perf.py -r -b rietveld --track_memory ../26/python ../q3/python Mem max: 14948.000 -> 55952.000: 73.28% larger Usage over time: http://tinyurl.com/ykcsbcs
That's not great, but it's way better than it used to be. Pushing future work on this to Q4.
Comment #10
Posted on Oct 26, 2009 by Helpful Elephantr876 deletes LLVM IR after compiling to native code. This is a ~5% drop in memory for most benchmarks, but Rietveld sees a 37% drop:
Before: >>> _llvm.dump_stats() LLVM functions: 279 LLVM basic blocks: 25079 LLVM instructions: 136274
After: >>> _llvm.dump_stats() LLVM functions: 279 LLVM basic blocks: 45 LLVM instructions: 209
Mem max: 55936.000 -> 40804.000: 37.08% smaller Usage over time: http://tinyurl.com/yj4esk2
See http://codereview.appspot.com/99044 for more stats.
Rietveld vs Python 2.6.1 (./perf.py -r -b rietveld --track_memory ../26/python ../obj/python):
Mem max: 14884.000 -> 40040.000: 62.83% larger Usage over time: http://tinyurl.com/yj2nm3u
That's heading the right direction.
Comment #11
Posted on Jan 6, 2010 by Helpful Elephant(No comment was entered for this change.)
Comment #12
Posted on Jan 11, 2010 by Helpful ElephantMost recent memory usage benchmarking (Q3 release vs trunk):
Biggest improvement:
slowpickle
Mem max: 20048.000 -> 17732.000: 1.1306x smaller Usage over time: http://tinyurl.com/y8cbek4
spambayes
Mem max: 52200.000 -> 46632.000: 1.1194x smaller Usage over time: http://tinyurl.com/ybu9jo8
Total:
2to3
Mem max: 46776.000 -> 45940.000: 1.0182x smaller Usage over time: http://tinyurl.com/yavvxtx
django
Mem max: 26324.000 -> 26496.000: 1.0065x larger Usage over time: http://tinyurl.com/ye7wua8
html5lib
Mem max: 178044.000 -> 173952.000: 1.0235x smaller Usage over time: http://tinyurl.com/yaw2qs3
nbody
Mem max: 15852.000 -> 17372.000: 1.0959x larger Usage over time: http://tinyurl.com/ybgurwh
slowpickle
Mem max: 20048.000 -> 17732.000: 1.1306x smaller Usage over time: http://tinyurl.com/y8cbek4
slowspitfire
Mem max: 96972.000 -> 97652.000: 1.0070x larger Usage over time: http://tinyurl.com/ydx4q8z
slowunpickle
Mem max: 15144.000 -> 14844.000: 1.0202x smaller Usage over time: http://tinyurl.com/y8doccw
spambayes
Mem max: 52200.000 -> 46632.000: 1.1194x smaller Usage over time: http://tinyurl.com/ybu9jo8
Comment #13
Posted on Jan 15, 2010 by Helpful ElephantTODO: finish http://codereview.appspot.com/97120/show, commit it.
Comment #14
Posted on Jan 25, 2010 by Helpful Elephantr1034 lazily initializes the runtime feedback DenseMaps, which turns out to be a nice win on larger codebases:
$ ./perf.py -f -m ../unladen-trunk/python ../unladen-swallow/python
Report on Linux muikyl 2.6.31-17-generic #54-Ubuntu SMP Thu Dec 10 17:01:44 UTC 2009 x86_64 Total CPU cores: 8
2to3
Mem max: 35596.000 -> 33584.000: 1.0599x smaller Usage over time: http://tinyurl.com/ycn2n3u
django
Mem max: 36844.000 -> 31936.000: 1.1537x smaller Usage over time: http://tinyurl.com/yfl4x5g
nbody
Mem max: 22336.000 -> 21516.000: 1.0381x smaller Usage over time: http://tinyurl.com/yeghupn
slowpickle
Mem max: 22060.000 -> 20920.000: 1.0545x smaller Usage over time: http://tinyurl.com/y9cltou
slowspitfire
Mem max: 148188.000 -> 145324.000: 1.0197x smaller Usage over time: http://tinyurl.com/ygzvjjm
slowunpickle
Mem max: 16048.000 -> 14992.000: 1.0704x smaller Usage over time: http://tinyurl.com/ybrfyta
spambayes
Mem max: 39160.000 -> 35464.000: 1.1042x smaller Usage over time: http://tinyurl.com/yeuj4ga
Comment #15
Posted on Jan 27, 2010 by Helpful Elephantr1044 shaved ~350kb off memory usage across the board. Another low-hanging sliver down, more to go.
Comment #16
Posted on Jan 27, 2010 by Helpful ElephantIdea from Joerg: introduce a two-phase hotness model, where we only collect feedback after code objects have passed a certain minimum hotness threshold. This will take some careful tuning, since if the feedback threshold is too low, we may introduce more bail sites and degrade performance.
Comment #17
Posted on Feb 4, 2010 by Helpful ElephantI've been experimenting with a two-phase hotness model, where we only gather feedback if co_hotness > 50 (or 100, or some other low number). I've added instrumentation to gather the number of feedback maps created:
$ ./perf.py -b nqueens --diff_instrumentation Feedback maps created: ../a-inst/python N: 149
../b-inst/python N: 49
$ ./perf.py -b django --diff_instrumentation Feedback maps created: ../a-inst/python N: 526
../b-inst/python N: 110
So we're definitely reducing the number of feedback maps created. But it doesn't matter:
$ ./perf.py -b nqueens --track_memory ../a/python ../b/python
nqueens
Mem max: 14224.000 -> 14100.000: 1.0088x smaller Usage over time: http://tinyurl.com/ykzofp2
$ ./perf.py -b slowspitfire --track_memory ../a/python ../b/python
slowspitfire
Mem max: 95540.000 -> 94464.000: 1.0114x smaller Usage over time: http://tinyurl.com/ydzt7t8
$ ./perf.py -b django --track_memory ../a/python ../b/python
django
Mem max: 23752.000 -> 23576.000: 1.0075x smaller Usage over time: http://tinyurl.com/yeqgk5n
Reducing the amount of data also results in bad branch predictions and other optimization degradations.
$ pprof --text ./python unladen-heap.0006.heap | head Total: 6.0 MB 3.0 49.9% 49.9% 3.0 49.9% new_arena (inline) 1.6 26.3% 76.2% 1.6 26.3% llvm::DenseMap::init (inline)
We use DenseMap for feedback, but we're not the only client: LLVM uses it all over the place. It may be that some optimization passes are keeping around data longer than we need/want them to. I'm not sure our feedback maps are the culprit here.
Comment #18
Posted on Feb 5, 2010 by Helpful ElephantFollow up from pprof about feedback maps:
$ pprof --text ./python /tmp/control-nqueens.hprof.0003.heap | grep FeedbackMap_New 0.0 0.1% 99.6% 0.0 4.6% PyFeedbackMap_New $ pprof --text ./python /tmp/control-django.hprof.0002.heap | grep FeedbackMap_New 0.0 0.1% 99.3% 0.2 4.9% PyFeedbackMap_New $ pprof --text ./python /tmp/control-2to3.hprof.0003.heap | grep FeedbackMap_New 0.0 0.0% 99.7% 0.2 2.4% PyFeedbackMap_New
At 200k max, I don't think that's the problem. I'm mining pprof/tcmalloc data for more leads.
Be sure to run Unladen with -g; that changes the profile.
Comment #19
Posted on May 21, 2010 by Helpful Elephant(No comment was entered for this change.)
Status: Started
Labels:
Type-Defect
Priority-High
Release-Merger