improve performance of pointer scan for leak reachability analysis #151

derekbruening · 2014-11-28T02:28:13Z

From derek.br...@gmail.com on December 10, 2010 17:58:04

PR 475518

my initial implementation of scanning for pointers for reachability analysis produces a noticeable 1-second
delay at app exit (and at nudge time for PR 428709). this case covers trying to speed that up. it's not
identifying the defined region that's slow, it's the pointer scan itself.

adding to this case a FIXME from my code that may improve perf noticeably:
skipping regions that have been read-only since being loaded (.text,
.rodata, etc.). we'll need extra tracking since today we can't tell
whether they were made writable and then changed back to read-only.
we could also skip writable regions that have not been written to since
being loaded since they can't point into the heap. technically
the app could write a heap pointer to a file and then mmap it in but we'll
ignore that.

my initial imp is going to skip r-x regions. I'm not going to skip all non-writable
though. this case covers adding modified-since-load info to
both avoid false neg w/ r-x and to skip more non-writable.

it does take substantial time on large apps.
we should probably fix up -no_count_leaks and document it, and perhaps make it default so that
nudges can be used for just updating the error summary w/o waiting 8 minutes?
just turning it off today => bugs: maybe somebody else needs
op_record_allocs in alloc.c. we still record for pre_us.

for PR 485354 I have the leak scan on windows skipping read-only image regions: once this case checks for history that can be done for *nix too

Server: perforce-panda.eng.vmware.com:1985

PR 520916: leak-check-only mode without losing accuracy

added -leaks_only option that does not do any shadowing
added -shadowing option as an internal debugging method for
disabling everything except malloc instrumentation.
it could be more thorough (e.g., don't need known_table, etc.) but
should prove useful. this is what my original -leaks_only experiment
used so I kept the code under -no_shadowing when I added stack zeroing.
split os_shared_{pre,post}syscall() from os_shadow{pre,post}_syscall()
for more than just options.shadowing: also for PR 536878 where
drheapstat will need fork-following
to rule out beyond-TOS, needs xsp in each thread at exit time: for that
I needed PR 536058 in DR (diff sent earlier, included in this tree since
needed to test)
to rule out stale stack entries, added zeroing on stack allocation by
adapting drmem's adjust_esp shadow code
for locating live mallocs inside an unmapped heap arena, which is done
using shadow info for regular drmem, I'm brute-force looking up every
8 bytes: but only on Windows since I'm assuming glibc malloc never
unmaps an arena that contains live mallocs. xref PR 535568's attempt
to replace the malloc table w/ an interval tree: too expensive!
added a -leaks-only test

PR 485354: # possible leaks nondet on nudge test on Windows
My leaks-only test hit PR 485354 so I investigated:

it's coming from .pdata RVA entries
I have windows scans skipping read-only image sections now:
PR 475518 covers doing that properly and on both plaforms by
monitoring history

The beyond-TOS + zero-on-stack-alloc seems to be working in practice as far
as I can tell (apps aren't exactly deterministic but results
seem to match pretty well w/ full shadowing). Zeroing is not 100%
transparent but it's close enough for me. We'll need more data on more
apps but I think it's going to work out, and the overhead should be a
strict subset of my original shadow-writes proposal. Unfortunately the
overhead is still significant: 2.2x on crafty (vs 1.6x for DrHeapstat).

xref PR 536878: add leak checking feature to Dr. Heapstat a la Dr. Memory
-leaks_only

future work under PR 539395: improve accuracy of -leaks_only:

issues w/ ignoring beyond-TOS w/ alternate stacks:
- if currently on alt sigstack will look beyond TOS on main stack
- if currently on main stack will look at whole alt stack
  could intercept signal + sigreturn.
are there uninit fields in kernel stack allocs for signal/Ki?
use an interval tree of mallocs (despite perf cost: see PR 535568) or
some other way to improve accuracy and efficiency of locating live
mallocs in an unmapped arena

Original issue: http://code.google.com/p/drmemory/issues/detail?id=151

derekbruening · 2014-11-28T02:28:14Z

From bruen...@google.com on January 12, 2012 11:37:45

wow looks like I pasted too much stuff into the initial entry text above

xref issue #568 (parallelize leak scan)
xref issue #711 (improve callstack walk)

** INFO breakdown of -leaks_only -no_zero_stack on ui_tests

this is NPAPITesterBase.NPObjectProxy

Process Name 64-bit Timer samples
ntoskrnl.exe 1 45.35
dynamorio.dll 22.47
drmemorylib.dll 12.31
dbghelp.dll 10.52
ntdll.dll 3.04
wow64.dll 1 2.89
wow64cpu.dll 1 2.26
Parity.sys 1 0.4
unknown module pid (6804) 1 0.27
9 modules, Total: 19377 samples, 99.51% of shown samples

CS:EIP Symbol + Offset Timer samples
0x2903a74 PsReturnProcessPagedPoolQuota 50.04
0x289b228 PsIsProtectedProcess 24.45
0x28cb490 KeSynchronizeExecution 10.79
0x2baa350 ObReferenceObjectByHandle 8.04
0x2916c24 KeQueryPriorityThread 1.12
0x28c65f0 memmove 0.88

CS:EIP Symbol + Offset Timer samples
0x6fe9b780 mutex_unlock 15.34
0x6fe9b740 mutex_trylock 14.65
0x6fee2470 dr_mutex_lock 12.55
0x6fee24d0 dr_mutex_unlock 11.06
0x6fe9d460 mutex_lock 10.17
0x6fe611f4 dynamorio_syscall_wow64 6.17
0x6fee7290 get_thread_private_dcontext 2.9
0x6fef70b0 query_virtual_memory 2.67
0x6fea24c0 common_heap_alloc 1.51
0x6fee7b70 query_memory_ex 1.28

CS:EIP Symbol + Offset Timer samples
0x6ffe7340 rb_in_node 59.78
0x6ffe5160 check_reachability_pointer 7.55
0x6ffee6c0 hashtable_lookup 7.05
0x6ffee4b0 hash_key 6.05
0x6ffd90c0 heap_region_bounds 6.01
0x6ffe5310 check_reachability_helper 3.75
0x6ffd6cf0 get_decoded_ptr 2.21
0x6ffee00a dr_mutex_unlock 1.67
0x6ffee010 dr_mutex_lock 1.63

CS:EIP Symbol + Offset Timer samples
0x54d730 MiniDumpReadDumpStream 99.51

analysis:

rb_in_node does highlight tree lookup as potential issue
however, rb_in_node is 60% of 12% => 7% total.
kernel is 45%. I'm not sure if the symbols in the shown
breakdown are useful b/c don't have private syms.
TLB miss is handled in hardware for x86 so would only be in
kernel on page miss. I'm wondering if most of this is page handling
during the scan.

parallelizing seems most promising for large enough apps: no
synch is needed on malloc lookup since read-only at that point. synch is
needed when updating leak data structs but common case is a
non-heap-pointer during scan.

Owner: bruen...@google.com
Labels: -Priority-Low -Type-Defect Priority-Medium Type-Enhancement Component-LeakCheck Performance

derekbruening · 2014-11-28T02:28:19Z

From bruen...@google.com on January 13, 2012 07:25:18

running entire unit_tests suite (minus tools/valgrind exclusions):
RUN: -no_check_uninitialized
[==========] 3554 tests from 577 test cases ran. (1636274 ms total)
[==========] 3554 tests from 577 test cases ran. (1479432 ms total)
[==========] 3554 tests from 577 test cases ran. (1518962 ms total)
RUN: -no_check_uninitialized -no_leak_scan
[==========] 3554 tests from 577 test cases ran. (1370891 ms total)
[==========] 3554 tests from 577 test cases ran. (1409782 ms total)
[==========] 3554 tests from 577 test cases ran. (1382568 ms total)
RUN: -no_check_uninitialized -no_count_leaks
[==========] 3554 tests from 577 test cases ran. (1198297 ms total)
[==========] 3554 tests from 577 test cases ran. (1209558 ms total)
[==========] 3554 tests from 577 test cases ran. (1260151 ms total)

getting the average in mins:

25.7482 -no_check_uninitialized
23.1291 -no_check_uninitialized -no_leak_scan
20.3778 -no_check_uninitialized -no_count_leaks
=>
scan: 2.6 mins
rest of leaks: 2.8 mins
leak total: 5.4 mins = 26% perf hit

I wonder how many sub-processes there are, for the scan to take that long: bears investigation whether that's all on the final scan.

Suite log dir shows 9 sets of logdirs which makes sense since 3x3 runs.
Each run has 10 unit_tests.exe logdirs.
So it's 10 scans total that are taking the 2.6 mins.
Probably not all equal, but average of 16s each which seems more reasonable.

derekbruening · 2014-11-28T02:28:25Z

From bruen...@google.com on January 13, 2012 07:33:39

leak overhead %scan %rest
ui_tests 40% 78% 22%
unit_tests 26% 48% 52%

derekbruening · 2014-11-28T02:28:30Z

From bruen...@google.com on January 17, 2012 08:25:38

more runs over weekend on windesk for some reason has
-no_check_uninitialized slower:
RUN: -no_check_uninitialized
[==========] 3554 tests from 577 test cases ran. (1815211 ms total)
[==========] 3554 tests from 577 test cases ran. (1891468 ms total)
[==========] 3554 tests from 577 test cases ran. (1552920 ms total)
RUN: -no_check_uninitialized -no_leak_scan
[==========] 3554 tests from 577 test cases ran. (1450673 ms total)
[==========] 3554 tests from 577 test cases ran. (1442520 ms total)
[==========] 3554 tests from 577 test cases ran. (1438121 ms total)
RUN: -no_check_uninitialized -no_count_leaks
[==========] 3554 tests from 577 test cases ran. (1249905 ms total)
[==========] 3554 tests from 577 test cases ran. (1222006 ms total)
RUN: -no_check_uninitialized -no_count_leaks -no_use_symcache
[==========] 3554 tests from 577 test cases ran. (3377504 ms total)
[==========] 3554 tests from 577 test cases ran. (3350393 ms total)
[==========] 3554 tests from 577 test cases ran. (3332369 ms total)
RUN: -leaks_only
[==========] 3554 tests from 577 test cases ran. (1090167 ms total)
[==========] 3554 tests from 577 test cases ran. (1099449 ms total)
[==========] 3554 tests from 577 test cases ran. (1095470 ms total)
RUN: -leaks_only -no_zero_stack
[==========] 3554 tests from 577 test cases ran. (1072919 ms total)
[==========] 3554 tests from 577 test cases ran. (1077422 ms total)
[==========] 3554 tests from 577 test cases ran. (1064477 ms total)
RUN: -leaks_only -no_count_leaks
[==========] 3554 tests from 577 test cases ran. (865069 ms total)
[==========] 3554 tests from 577 test cases ran. (864243 ms total)
[==========] 3554 tests from 577 test cases ran. (860628 ms total)
RUN: -leaks_only -no_count_leaks -no_track_allocs
[==========] 3554 tests from 577 test cases ran. (744878 ms total)
[==========] 3554 tests from 577 test cases ran. (769345 ms total)
[==========] 3554 tests from 577 test cases ran. (748245 ms total)
RUN: native
[==========] 3554 tests from 577 test cases ran. (704779 ms total)
[==========] 3554 tests from 577 test cases ran. (580533 ms total)
[==========] 3554 tests from 577 test cases ran. (572552 ms total)
% perl -e 'while(){if(/^RUN: (.)/){$nr=$1;if($n>0){printf"%5.2f %-30s\n",$t/(60_1000$n),$run;}$run=$nr;$n=0;$t=0;}elsif(/(\d+) ms/){$t+=$1;$n++;}}if($n>0){printf"%5.2f %-30s\n",$t/(60_1000*$n),$run;}'
29.22 -no_check_uninitialized
24.06 -no_check_uninitialized -no_leak_scan
20.60 -no_check_uninitialized -no_count_leaks
55.89 -no_check_uninitialized -no_count_leaks -no_use_symcache
18.25 -leaks_only
17.86 -leaks_only -no_zero_stack
14.39 -leaks_only -no_count_leaks
12.57 -leaks_only -no_count_leaks -no_track_allocs
10.32 native

derekbruening · 2014-11-28T02:28:36Z

From bruen...@google.com on January 17, 2012 08:30:15

I ran like so:

for j in "-no_check_uninitialized" "-no_check_uninitialized -no_leak_scan" "-no_check_uninitialized -no_count_leaks" "-no_check_uninitialized -no_count_leaks -no_use_symcache" "-leaks_only" "-leaks_only -no_zero_stack" "-leaks_only -no_count_leaks" "-leaks_only -no_count_leaks -no_track_allocs"; do
echo "RUN: $j" | tee -a OUT
for ((i=0; i<3; i++)); do
~/drmemory/git/build_drmem_rel/bin/drmemory.exe $j -suppress 'e:\src\chromium\src\tools\valgrind\DrMemory\suppressions.txt' -quiet -batch -no_results_to_stderr -dr e:/src/dr/git/exports -- ./unit_tests.exe --gtest_filter="-P2PTransportImplTest.FAILS_SendDataTcp:ServiceProcessStateTest.FAILS_ForceShutdown:FirefoxImporterTest.FirefoxNSS3Decryptor:DownloadManagerTest.FLAKY_StartDownload:P2PTransportImplTest.Create:DownloadManagerTest.StartDownload:DownloadManagerTest.FAILS_StartDownload:P2PTransportImplTest.FLAKY_SendDataUdp:VisitedLinkRelayTest.Basics:ConnectionTesterTest.FLAKY_RunAllTests:MultiProcessLockTest.FAILS_RecursiveLock:SignedSettingsTest.FLAKY_StorePolicyNoPolicyData:ConnectionTesterTest.FAILS_RunAllTests:PredictorTest.FAILS_MassiveConcurrentLookupTest:P2PTransportImplTest.FAILS_ConnectTcp:RenderViewTest.ImeComposition:VisitedLinkEventsTest.FLAKY_Coalescense:VideoCaptureHostTest.:P2PTransportImplTest.FLAKY_SendDataTcp:P2PTransportImplTest.FAILS_ConnectUdp:ProcessWatcherTest.FLAKY_ImmediateTermination:P2PTransportImplTest.FAILS_Create:ProcessWatcherTest.ImmediateTermination:AudioRendererHostTest.:P2PTransportImplTest.ConnectUdp:RenderViewTest.FLAKY_ImeComposition:P2PTransportImplTest.FLAKY_ConnectUdp:RenderViewTest.FAILS_ImeComposition:P2PTransportImplTest.ConnectTcp:SignedSettingsTest.StorePolicyNoPolicyData:VisitedLinkRelayTest.FAILS_Basics:PredictorTest.FLAKY_MassiveConcurrentLookupTest:VisitedLinkEventsTest.FAILS_Coalescense:FirefoxImporterTest.FLAKY_FirefoxNSS3Decryptor:VisitedLinkEventsTest.Coalescense:P2PTransportImplTest.FLAKY_ConnectTcp:P2PTransportImplTest.SendDataTcp:P2PTransportImplTest.FLAKY_Create:P2PTransportImplTest.FAILS_SendDataUdp:ServiceProcessStateTest.FLAKY_ForceShutdown:PredictorTest.MassiveConcurrentLookupTest:MultiProcessLockTest.FLAKY_RecursiveLock:VisitedLinkRelayTest.FLAKY_Basics:ConnectionTesterTest.RunAllTests:ServiceProcessStateTest.ForceShutdown:P2PTransportImplTest.SendDataUdp:MultiProcessLockTest.RecursiveLock:ProcessWatcherTest.FAILS_ImmediateTermination:SignedSettingsTest.FAILS_StorePolicyNoPolicyData:FirefoxImporterTest.FAILS_Firefox*NSS3Decryptor" >> OUT
done
done
grep -E '^RUN:|test cases? ran' OUT

derekbruening · 2014-11-28T02:28:41Z

From bruen...@google.com on December 12, 2012 09:30:57

xref issue #1096

derekbruening added Migrated Priority-Medium Type-Feature Component-LeakCheck Performance labels Nov 28, 2014

This was referenced Nov 28, 2014

parallelize leak scan #568

Open

string interpreted as malloc anchor => possible leak #625

Closed

improve callstack walk perf further #711

Open

pattern-based unaddressable-only mode #750

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve performance of pointer scan for leak reachability analysis #151

improve performance of pointer scan for leak reachability analysis #151

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

improve performance of pointer scan for leak reachability analysis #151

improve performance of pointer scan for leak reachability analysis #151

Comments

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014

derekbruening commented Nov 28, 2014