My favorites | Sign in
Project Home Downloads Wiki Issues Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 177218: deadlocking in child process on fork() because nVidia's driver uses pthread_atfork with malloc
65 people starred this issue and may be notified of changes. Back to list
 
Project Member Reported by mnissler@chromium.org, Feb 20, 2013
After running browser tests, I see processes sticking: The zygote, the sandbox IPC handle process and a GPU process that is stuck in fork(). Stack trace of the latter is as follows:

#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:39
#1  0x000000000210a485 in base::internal::SpinLockDelay (w=0x4e2fd60 <tcmalloc::Static::pageheap_lock_>, value=2, loop=69949)
    at ../../third_party/tcmalloc/chromium/src/base/spinlock_linux-inl.h:97
#2  0x000000000210a27b in SpinLock::SlowLock (this=0x4e2fd60 <tcmalloc::Static::pageheap_lock_>)
    at ../../third_party/tcmalloc/chromium/src/base/spinlock.cc:132
#3  0x00000000020f5ecb in SpinLock::Lock (this=0x4e2fd60 <tcmalloc::Static::pageheap_lock_>)
    at ../../third_party/tcmalloc/chromium/src/base/spinlock.h:75
#4  0x00000000020f5f99 in SpinLockHolder::SpinLockHolder (this=0x7fdeba5d1310, l=0x4e2fd60 <tcmalloc::Static::pageheap_lock_>)
    at ../../third_party/tcmalloc/chromium/src/base/spinlock.h:141
#5  0x00000000021019f5 in (anonymous namespace)::do_malloc_pages (heap=0x7fdec0d30480, size=237568)
    at ../../third_party/tcmalloc/chromium/src/tcmalloc.cc:1073
#6  0x0000000002101b85 in (anonymous namespace)::do_malloc (size=236675) at ../../third_party/tcmalloc/chromium/src/tcmalloc.cc:1114
#7  0x000000000210635b in MallocBlock::Allocate (size=236623, type=-271733872) at ../../third_party/tcmalloc/chromium/src/debugallocation.cc:522
#8  0x00000000021038ef in DebugAllocate (size=236623, type=-271733872) at ../../third_party/tcmalloc/chromium/src/debugallocation.cc:999
#9  0x00000000021073f0 in do_debug_malloc_or_debug_cpp_alloc (size=236623) at ../../third_party/tcmalloc/chromium/src/debugallocation.cc:1166
#10 0x0000000003322547 in tc_calloc (count=1, size=236623) at ../../third_party/tcmalloc/chromium/src/debugallocation.cc:1187
#11 0x00007fdeb8b399d3 in ?? () from /usr/lib/nvidia-current/libGL.so.1
#12 0x00007fdeb75dcbaf in ?? () from /usr/lib/nvidia-current/libnvidia-glcore.so.295.40
#13 0x00007fdeb8b131cd in ?? () from /usr/lib/nvidia-current/libGL.so.1
#14 0x00007fdeb8b19d7f in ?? () from /usr/lib/nvidia-current/libGL.so.1
#15 0x00007fdeb8b19e78 in ?? () from /usr/lib/nvidia-current/libGL.so.1
#16 0x00007fdeb8b1a60e in ?? () from /usr/lib/nvidia-current/libGL.so.1
#17 0x00007fdec4b9fa46 in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:189
#18 0x00007fded2484eef in base::LaunchProcess (argv=std::vector of length 12, capacity 16 = {...}, options=..., process_handle=0x7fdeba5d1c44)
    at ../../base/process_util_posix.cc:592
#19 0x00007fded24856ee in base::LaunchProcess (cmdline=..., options=..., process_handle=0x7fdeba5d1c44) at ../../base/process_util_posix.cc:777
#20 0x00007fdec85ee6ca in content::ChildProcessLauncher::Context::LaunchInternal (this_object=..., client_thread_id=content::BrowserThread::IO, 
    child_process_id=2, use_zygote=false, env=std::vector of length 0, capacity 0, ipcfd=39, cmd_line=0x14631be38020)
    at ../../content/browser/child_process_launcher.cc:246
#21 0x00007fdec85f0f97 in base::internal::RunnableAdapter<void (*)(scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*)>::Run (this=0x7fdeba5d1d50, a1=..., a2=@0x14631c003940: content::BrowserThread::IO, a3=@0x14631c003944: 2, a4=@0x14631c003948: false, 
    a5=std::vector of length 0, capacity 0, a6=@0x14631c003968: 39, a7=@0x14631c003970: 0x14631be38020) at ../../base/bind_internal.h:584
#22 0x00007fdec85f099d in base::internal::InvokeHelper<false, void, base::internal::RunnableAdapter<void (*)(scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*)>, void (content::ChildProcessLauncher::Context*, content::BrowserThread::ID const&, int const&, bool const&, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int const&, CommandLine* const&)>::MakeItSo(base::internal::RunnableAdapter<void (*)(scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*)>, content::ChildProcessLauncher::Context*, content::BrowserThread::ID const&, int const&, bool const&, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int const&, CommandLine* const&) (runnable=..., a1=0x14631bd74200, 
    a2=@0x14631c003940: content::BrowserThread::IO, a3=@0x14631c003944: 2, a4=@0x14631c003948: false, a5=std::vector of length 0, capacity 0, 
    a6=@0x14631c003968: 39, a7=@0x14631c003970: 0x14631be38020) at ../../base/bind_internal.h:1068
#23 0x00007fdec85eff31 in base::internal::Invoker<7, base::internal::BindState<base::internal::RunnableAdapter<void (*)(scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*)>, void (scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*), void (scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > >, int, CommandLine*)>, void (scoped_refptr<content::ChildProcessLauncher::Context>, content::BrowserThread::ID, int, bool, std::vector<std::pair<std::string, std::string>, std::allocator<std::pair<std::string, std::string> > > const&, int, CommandLine*)>::Run(base::internal::BindStateBase*) (base=0x14631c003920) at ../../base/bind_internal.h:2518
#24 0x00007fded2410fa1 in base::Callback<void ()>::Run() const (this=0x7fdeba5d2138) at ../../base/callback.h:396
#25 0x00007fded2453343 in MessageLoop::RunTask (this=0x7fdeba5d2af0, pending_task=...) at ../../base/message_loop.cc:476
#26 0x00007fded245345a in MessageLoop::DeferOrRunPendingTask (this=0x7fdeba5d2af0, pending_task=...) at ../../base/message_loop.cc:488
#27 0x00007fded2453cc1 in MessageLoop::DoWork (this=0x7fdeba5d2af0) at ../../base/message_loop.cc:671
#28 0x00007fded245bef6 in base::MessagePumpDefault::Run (this=0x14631be00ec0, delegate=0x7fdeba5d2af0) at ../../base/message_pump_default.cc:29
#29 0x00007fded2452f4b in MessageLoop::RunInternal (this=0x7fdeba5d2af0) at ../../base/message_loop.cc:433
#30 0x00007fded2452e06 in MessageLoop::RunHandler (this=0x7fdeba5d2af0) at ../../base/message_loop.cc:406
#31 0x00007fded24893e4 in base::RunLoop::Run (this=0x7fdeba5d2580) at ../../base/run_loop.cc:45
#32 0x00007fded245273e in MessageLoop::Run (this=0x7fdeba5d2af0) at ../../base/message_loop.cc:313
#33 0x00007fded24c6820 in base::Thread::Run (this=0x14631bd4f520, message_loop=0x7fdeba5d2af0) at ../../base/threading/thread.cc:152
#34 0x00007fdec85e348f in content::BrowserThreadImpl::ProcessLauncherThreadRun (this=0x14631bd4f520, message_loop=0x7fdeba5d2af0)
    at ../../content/browser/browser_thread_impl.cc:142
#35 0x00007fdec85e3753 in content::BrowserThreadImpl::Run (this=0x14631bd4f520, message_loop=0x7fdeba5d2af0)
    at ../../content/browser/browser_thread_impl.cc:178
#36 0x00007fded24c69ab in base::Thread::ThreadMain (this=0x14631bd4f520) at ../../base/threading/thread.cc:197
#37 0x00007fded24b9ddb in base::(anonymous namespace)::ThreadFunc (params=0x14631be82930) at ../../base/threading/platform_thread_posix.cc:68
#38 0x00007fdec5dc9e9a in start_thread (arg=0x7fdeba5d3700) at pthread_create.c:308
#39 0x00007fdec4bd3cbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#40 0x0000000000000000 in ?? ()

Attempt at an explanation: It looks like nvidia's libGL.so.1 has a __register_atfork() handler that tries to allocate memory. And we're unlucky because the tcmalloc state was locked when we did the fork and now the thread that locked it is no longer present in the forked process. Presto, deadlock.

There's some additional background on fork() vs. threading here: http://www.linuxprogrammingblog.com/threads-and-fork-think-twice-before-using-them

The TL;DR is that mixing threads and fork() is not a good idea, and the mitigation we have in base::LaunchProcess (namely doing an execvp right away, but for different reasons according to the comment) doesn't work because the atfork handlers run even before the fork() call returns in the child process... It seems like fork()ing in the browser process (or any multi-threaded process) is dangerous and shouldn't be done in the first place.

Not sure whether this occurs also in production browser, but I don't see why it couldn't happen. Adding a few people and slapping on a few labels to raise attention. 

Feb 25, 2013
#1 jamesr@chromium.org
This looks familiar
Cc: ccameron@chromium.org
Feb 25, 2013
#2 ccameron@chromium.org
Yes, this came up in chromium-dev a while back
https://groups.google.com/a/chromium.org/forum/?fromgroups=#!topic/chromium-dev/uLf5l669dCk

NVIDIA's driver calls malloc inside the child process's atfork handler, which isn't POSIX-compliant ("the child process may only execute async-signal-safe operations until such time as one of the exec functions is called"). 

The issue was filed with NVIDIA as incident #121217-000242.
Feb 26, 2013
#3 mnissler@chromium.org
Anyhow, should we really be forking from within the browser process? Shouldn't forking be handled by the zygote for that reason?
Feb 27, 2013
#4 phajdan.jr@chromium.org
It does indeed happen in "production" browser.
Summary: ChildProcessLauncher deadlocking in child process on fork() because nVidia's driver uses pthread_atfork with malloc (was: ChildProcessLauncher deadlocking in child process on fork())
Cc: phajdan.jr@chromium.org
Labels: TaskForce-GreenTree
Feb 28, 2013
#5 gli...@chromium.org
(No comment was entered for this change.)
Cc: kcc@chromium.org
Mar 9, 2013
#6 bugdro...@chromium.org
(No comment was entered for this change.)
Labels: -Area-Internals -Feature-GPU -Internals-Core Cr-Internals-GPU Cr-Internals Cr-Internals-Core
Mar 28, 2013
#7 phajdan.jr@chromium.org
This is not specific to ChildProcessLauncher. Any fork() is prone to this.
Summary: deadlocking in child process on fork() because nVidia's driver uses pthread_atfork with malloc (was: ChildProcessLauncher deadlocking in child process on fork() because nVidia's driver uses pthread_atfork with malloc)
Cc: zheng...@chromium.org
Mar 29, 2013
#8 bugdro...@chromium.org
------------------------------------------------------------------------
r191469 | phajdan.jr@chromium.org | 2013-03-29T23:35:03.167792Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/masters/master.chromium.gpu/master.cfg?r1=191469&r2=191468&pathrev=191469

Disable tcmalloc on nvidia GPU bots to work around hangs.

BUG=188501, 177218
Review URL: https://codereview.chromium.org/13342002
------------------------------------------------------------------------
Apr 8, 2013
#9 phajdan.jr@chromium.org
 Issue 123583  has been merged into this issue.
Apr 8, 2013
#10 phajdan.jr@chromium.org
 Issue 86948  has been merged into this issue.
Apr 8, 2013
#11 phajdan.jr@chromium.org
Restricting comment adding to avoid noise (sorry).

This is bug about known issue with nvidia drivers. For intel or other drivers, please file separate bugs, make it clear it's for intel drivers only, and try to obtain a meaningful stack trace of hung processes.
Labels: Restrict-AddIssueComment-EditIssue
Apr 10, 2013
#12 mnissler@chromium.org
Excuse my ignorance, but fixing the driver is not the right course of action here IMHO. fork()ing multi-threaded processes is generally a bad idea and the tcmalloc deadlock is probably only the tip of the iceberg. We should really handle all forks via a single-threaded helper, such as the zygote (which I thought was the rule before I found this code).
May 2, 2013
#13 bugdro...@chromium.org
------------------------------------------------------------------------
r198045 | tonyg@chromium.org | 2013-05-03T03:37:08.190459Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/ui/ui_test.cc?r1=198045&r2=198044&pathrev=198045
   M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/automation/proxy_launcher.cc?r1=198045&r2=198044&pathrev=198045
   M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/base/chrome_process_util.cc?r1=198045&r2=198044&pathrev=198045
   M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/webdriver/webdriver_automation.cc?r1=198045&r2=198044&pathrev=198045
   M http://src.chromium.org/viewvc/chrome/trunk/src/chrome/test/base/chrome_process_util.h?r1=198045&r2=198044&pathrev=198045

Kill all chrome processes on linux when shutting down the ProxyLauncher.

The startup performance_ui_tests have been hanging on the perf bots since they
were upgraded to Precise. The hang is due to the performance_ui_tests binary not
killing all of its child chrome processes which is due to bug 177218. When that
happens, the buildbot step hangs until timeout waiting for the process group to
end.

The suggested workaround for the bug is to disable tcmalloc, which works, but is
undesireable to do on the perf bots. So this patch works around the subprocess
hang by always kill()ing all chrome processes on linux.

BUG=235893
TEST=performance_ui_tests --gtest_filter=ShutdownTest.*

Review URL: https://chromiumcodereview.appspot.com/14707006
------------------------------------------------------------------------
Nov 20, 2013
#14 phajdan.jr@chromium.org
nvidia-drivers-331.20 claims to fix this issue:

"Fixed a bug that could cause a deadlock when forking from OpenGL programs which use some malloc implementations, such as TCMalloc."

I'd appreciate some tests with that - please report results.
Labels: -Restrict-AddIssueComment-EditIssue
Nov 20, 2013
#15 dvpdiner2
Based on a few days with the new nVidia drivers, I haven't seen any Chrome_ProcessL processes recently. :)
Sign in to add a comment

Powered by Google Project Hosting