
tenfourfox - issue #191
Convert NSPR and Chromium mutexes to kernel atomic operations
Between 10 and 17, certain sites regressed mightily in performance on MP systems. The most egregious was ifixit, which renders fine on a 1GHz iMac G4 and horribly slowly on a quad G5. Shark shows almost 47% of runtime spent waiting for semaphore locks. Mozilla changed JS from its own lock to NSPR locks in this timeframe; JS used to use libkern ATOMIC locks.
Stubbing the NSPR locks in JS to call libkern ATOMIC locks again, plus adding a small customized atomic set routine (ATOMIC in 10.4 does not have an atomic set, just an atomic compare-and-swap), restores almost all of the performance. This is being shipped in 17.0.
We should simply replace NSPR locks with libkern locks in the whole system, not just JavaScript, which should overcome this bottleneck particularly in NSS which uses them heavily. In particular, we should replace PR_ATOMIC_INCREMENT, _DECREMENT, _ADD and _SET with OSAtomicIncrement32Barrier, OSAtomicDecrement32Barrier, OSAtomicAdd32Barrier and TenFourFoxAtomicSet, and move TenFourFoxAtomicSet out of JS and into NSPR. This should land on 17.0.1 and 18. It will probably marginally help uniprocessor systems, but it will definitely help multiprocessor machines.
Comment #1
Posted on Nov 16, 2012 by Massive RhinoMozilla is trying to purge locks in JS, period, in Fx19: https://bugzilla.mozilla.org/show_bug.cgi?id=772722
Still, we still want this for the rest of the browser.
Comment #2
Posted on Nov 16, 2012 by Massive RhinoAnd, looks like bug 675078 was to blame, which didn't reland until earlier this year. https://bugzilla.mozilla.org/show_bug.cgi?id=675078
Wonder why we didn't notice this in Fx12. Anyway.
Comment #3
Posted on Nov 17, 2012 by Massive RhinoUndoubtedly some of this is also due to the poor performance of PR_Lock, which is implemented in terms of pthread_mutex and this is a known bad operator on OS X. So let's limit this to NSPR atomic operations for now.
Comment #4
Posted on Nov 17, 2012 by Massive RhinoAnd, for that matter, the Atomics are also internally implemented in terms of pthread_mutex_lock, so simply removing Atomics from the rest of the source should improve things. There's a lot of PR_Lock in hashtables though.
Comment #5
Posted on Nov 17, 2012 by Massive RhinoHmm. In pratom.h,
elif ((GNUC > 4) || (GNUC == 4 && GNUC_MINOR >= 1)) && \
So it's only because we're still on gcc 4.0.1. This should "just work" on gcc 4.6+. Thus this needs to only be done to 17.
Comment #6
Posted on Nov 18, 2012 by Massive RhinoHere's another site that threads crash into each other on: www.kfiam640.com
Comment #7
Posted on Nov 19, 2012 by Massive RhinoAfter patching this out, overhead drops a bit, but we're still spending ~40% of our time in semaphore lock on the affected sites. Shark looks like this:
0.0% 95.2% libSystem.B.dylib _pthread_body
0.0% 90.5% libnspr4.dylib PR_Select
0.0% 61.9% XUL XRE_AddStaticComponent
0.0% 57.1% XUL 0x3eef784 [176B]
0.0% 57.1% XUL XRE_AddStaticComponent
0.0% 47.6% XUL XRE_AddStaticComponent
0.0% 42.9% libnspr4.dylib PR_Wait
0.0% 42.9% libnspr4.dylib PR_WaitCondVar
0.0% 42.9% libSystem.B.dylib pthread_cond_wait
42.9% 42.9% libSystem.B.dylib semaphore_wait_signal_trap
0.0% 4.8% libnspr4.dylib PR_WaitCondVar
0.0% 4.8% XUL NS_CycleCollectorSuspect_P
0.0% 4.8% XUL 0x299ec58 [788B]
0.0% 4.8% libnspr4.dylib PR_WaitCondVar
0.0% 9.5% XUL 0x29b2afc [884B]
0.0% 4.8% XUL XRE_StartupTimelineRecord
0.0% 4.8% XUL js_GetScriptLineExtent(JSScript*)
0.0% 4.8% XUL js::IterateChunks(JSRuntime*, void*, void (*)(JSRuntime*, void*, js::gc::Chunk*))
0.0% 4.8% XUL js::IndirectWrapper::toWrapper()
0.0% 4.8% XUL std::deque<FilePath, std::allocator<FilePath> >::_M_push_back_aux(FilePath const&)
0.0% 4.8% firefox start
Comment #8
Posted on Nov 20, 2012 by Massive RhinoAfter review, we actually do want to keep using OS X internal atomics even after the compiler switch because OS X self-optimizes the atomics on uniprocessor macs and the gcc intrinsics don't. We will also have G3/G4 and G5 versions of the atomic set (G3/G4 using eieio/isync, G5 using lwsync/isync).
Comment #9
Posted on Nov 20, 2012 by Massive RhinoThe fully barriered AtomicSet reduces V8 by nearly 10%. That's bad, and we can actually test JavaScript. So let's create a non-barrier AtomicSet and use that, and give JavaScript non-barriered OS X atomics also, and keep the memory barrier version for the rest of the browser.
Comment #10
Posted on Nov 24, 2012 by Massive RhinoInitial version done for 17.0.1pre
Comment #11
Posted on Nov 26, 2012 by Happy RabbitDid you notice that the existing but (currently) unused implementations of the NSPR atomics are non barrier versions? (see nsprpub/pr/src/md/unix/os_Darwin_ppc.s) And there is also an implementation wrapped around OSAtomic functions, without barriers as well. It's ifdef'ed to be used on ARM CPUs only. This one is actually in use - only ppc (32 bit only), i386 and x86_64 use the gcc intrinsics instead of the existing machine dependent implementations.
I really can't see (int the Aurora 19 sources) how even with gcc 4.0.1 on 10.4 NSPR would build the PR_Lock code path. There must be something going wrong in nsprpub/pr/include/md/_darwin.h where _PR_HAVE_ATOMIC_OPS gets defined which should cause the Atomics in nsprpub/pr/src/md/unix/os_Darwin_ppc.s to be used .
The attached patch makes NSPR use OSAtomic functions for all OS X versions from 10.4 on independent of the architecture, the way it was already implemented for ARM CPUs. Furthermore it provides implementations of PR_StackPush and PR_StackPop copied and adapted from their AIX counterpart, getting rid of a nasty dependency on PR_Lock. I wonder though whether the non barrier version is really the appropriate one. I made PR_StackPush and PR_StackPop use the barrier version because the AIX version does explicitely use the barrier versions while non barrier versions are also available there.
So far the browser works as always and performance is good. PR_StackPush and PR_StackPop are used quite frequently from what I see when setting breakpoints in them.
- OSAtomic.patch 6.33KB
Comment #12
Posted on Nov 26, 2012 by Massive RhinoI'll definitely take the PR_Stack* ops, anything to reduce the use of PR_Lock. I'll have to study them later to see how they work, but I'm all for anything that gets us out of mutex hell.
I'm not as happy with the _MD_ATOMIC_SET: I'd rather use our assembler ones because we know exactly what is getting emitted. It doesn't get a lot simpler than lwarx/stwcx. in a loop. But if the browser seems okay with a non-barrier atomic_set, we could just use that everywhere. (It seems to be fine, but there's a lot of atomic usage in NSS and I really don't want to get that wrong -- maybe the best solution is to try to get NSPR and NSS testing working at the same time.)
Comment #13
Posted on Nov 26, 2012 by Happy RabbitThe advantage of _MD_ATOMIC_SET as it is implemented upstream is however that the kernel does its multi-/uniprocessor optimization. I knew you won't be too happy with it. The optimizations to AtomicSet should definitely go into nsprpub/pr/src/md/unix/os_Darwin_ppc.s:__PR_DarwinPPC_AtomicSet . My patch also gets the NSPR tests working here. Just "(g)make runtests" in {build root}/nsprpub/pr/tests .
Comment #14
Posted on Nov 26, 2012 by Massive RhinoThat's true (about the optimizations) but if we can get away with no barriers everywhere, then we don't need it because we'll already be working barrier-free.
I know about os_Darwin_ppc.s, but for some reason the build system was not picking up my changes, maybe a bug, which is why I implemented it in os_Darwin.s instead. I'll have to mess with that.
Looks like the PR_Stack* usage comes from NSPR's file descriptor cache, so that makes sense they get used a lot.
Comment #15
Posted on Nov 26, 2012 by Happy RabbitAs we're already dealing with functions in the so called comm page. The objective-C/C++ direct dispatch flag isn't fully implemented in gcc-4.6 and later. It generates calls to objc_msgSend_Fast and objc_assign_ivar_Fast which are just stubs that Apple gcc replaces with local branches to 0xFFFEFF00 or 0xFFFEFEC0. I already looked into that and it seems that can be implemented as builtins.
Comment #16
Posted on Nov 26, 2012 by Massive RhinoIs that needed to compile? I'm still ripping out the 4.0.1 stuff from the 17 changesets, so I'm at least a week or two away from even trying to build 19.
Comment #17
Posted on Nov 26, 2012 by Happy RabbitNo it's not needed as long as you don't pass the flag -fobjc-direct-dispatch to gcc. Direct dispatching was only used in the 32 bit Objective-C runtime so that might be the reason it wasn't implemented in current gcc. It's just a matter of performance once again.
Comment #18
Posted on Nov 26, 2012 by Massive RhinoAh, ok. We should spin that off into a separate issue.
Comment #19
Posted on Nov 26, 2012 by Happy RabbitI'll open an issue once I know more - maybe I'll first try to get the WebKit java script interpreter working... The browser is working really well. Almost feels full speed although the machine is building Aurora in the background. Haven't tested yet without the build running.
Comment #20
Posted on Nov 27, 2012 by Massive RhinoFailing lots of tests, though I don't know what the state was prior to this.
Comment #21
Posted on Nov 27, 2012 by Massive Rhino(so far failing bigfile3, dlltest {even with your patch}, getai, libfilename, mbcs, and hangs up on nameshm1)
Comment #22
Posted on Nov 27, 2012 by Massive RhinoSemaphores are failing bigtime; they just hang. I need to back this out and see if it passed to begin with.
Comment #23
Posted on Nov 27, 2012 by Massive RhinoNope, they don't. In fact, they just hang until they time out. This might have been the whole problem!
Hacking pr/src/pthreads/ptsynch.c, it seems that sem_otime never gets set to a non-zero value (but sem_ctime does). From bug 469744 https://bugzilla.mozilla.org/show_bug.cgi?id=469744, it doesn't look like semaphores were ever properly tested on 10.4. Changing it to sem_ctime causes it to succeed, and all the semaphore tests now pass.
https://bugzilla.mozilla.org/show_bug.cgi?id=370420 implies that failure of some tests is a normal phenomenon, but that really sucks.
Comment #24
Posted on Nov 27, 2012 by Massive RhinoApparently I'm not going insane: http://lists.apple.com/archives/darwin-dev/2005/Mar/msg00147.html
Comment #25
Posted on Nov 27, 2012 by Massive RhinoAfter thinking it over, the ctime hack is going to lead to a race condition sooner or later. Since we have the choice of POSIX or SysV semaphores in 10.4, changed md/_darwin.h to POSIX semaphores, and the semaphore tests all pass. I'm going to leave the rest alone for now. I restored the PR_Stack* ops and we'll see how the browser behaves when it's done building.
Comment #26
Posted on Nov 27, 2012 by Happy RabbitI forgot to mention the failing tests - I didn't mind the failures because the browser is working well obviously. And I have to "mske clean" after doing any change to NSPR sources since dependency tracking doesn't seem to work there.
Comment #27
Posted on Nov 27, 2012 by Massive RhinoInitial tests look good (with the change to PR_OpenSemaphore and your PR_Stack* ops, and removing all memory barriers). Obviously a lot less hanging in semaphores too. KFI is now bearable; I need to throw it through Shark again to see what's left to shake out, but this does look better. I'm going to do a more thorough test on it after work.
Comment #28
Posted on Nov 27, 2012 by Happy RabbitComment deleted
Comment #29
Posted on Nov 27, 2012 by Happy RabbitLooking into the xnu sources shows that the sem_otime bug was fixed in 10.5 (http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-792.6.70 vs http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-1228). Though there's a difference between using POSIX and SYSV semaphores in the test results. With SYSV semaphores ranfile waits infinitely in pthread_wait_cond and with POSIX semaphores nonblock fails.
Comment #30
Posted on Nov 27, 2012 by Happy RabbitNo, those two tests sometimes fail and sometimes pass independently of SYSV or POSIX semaphores. So both do work on 10.5 .
Comment #31
Posted on Nov 27, 2012 by Massive RhinoIf you use POSIX semaphores, does the browser still work on 10.5 or do you get stalls? It's going to be really hard to ship a browser that's dynamically selectable.
Comment #32
Posted on Nov 27, 2012 by Massive Rhino(nvm, you just answered my question)
Comment #33
Posted on Nov 27, 2012 by Happy RabbitWhen you search the whole source code for OSAtomic you will find quite some usages of those functions - and all of them (except the upcoming unused jemalloc 3) use the barrier versions. And that said, on an architecture with out of order instruction scheduling a lock is a true lock only when it deals with that issue. For locking purposes you have to pay the price for using multiple CPUs and do the syncing and memory barriers.
I also found out that chromium IPC uses OSAtomic only on Intel OS X - I changed that to apply to OS X systems independent of the CPU architecture. I was surprised to find find out that cairo apparently uses pthread mutexes although there is code to use system atomics but the necessary defines aren't done anywhere. Is cairo used for anything more than a azure backend fallback?
Comment #34
Posted on Nov 28, 2012 by Massive RhinoI'll fix the Chromium code. Cairo is used heavily for graphics still -- we don't have Azure content yet. Maybe we need to add atomic support there too.
However, assuming this quad G5 is the worst case for CPU on Power Mac, I haven't had any problems banging on it this afternoon. Web workers and multiple tabs loading are all rock solid. However, PR_Wait is still showing up a lot on kfi, so we haven't fixed its core issue though we've made a lot of other things a lot better.
Comment #35
Posted on Nov 28, 2012 by Massive RhinoActually, let's use barrier for Chromium IPC. If anything would break with a non-barrier sequence, it would be IPC.
Comment #36
Posted on Nov 28, 2012 by Happy RabbitWell, here is what I did to the Chromium IPC.
- chromium.patch 19.47KB
Comment #37
Posted on Nov 28, 2012 by Massive RhinoWhy does the x86.h file need to be deleted?
Comment #38
Posted on Nov 28, 2012 by Happy RabbitNo, I renamed it (just removed x86 from its name because it's nonsense) but that's not obvious in diffs.
Comment #39
Posted on Nov 28, 2012 by Happy RabbitI finally found out that in gfx/cairo/cairo/src/cairo-platform.h CAIRO_NO_MUTEX is defined which short circuits all mutexes in cairo. So there should be any problem in cairo at all.
Comment #40
Posted on Nov 28, 2012 by Massive RhinoToo bad about Cairo. Did you make any changes to the x86 file? I might just simplify it for testing by just including it, and we can tidy it up later (not that Chromium changes a lot, but just to reduce merge burden).
Comment #41
Posted on Nov 28, 2012 by Happy RabbitNo changes; just including the x86 header on ppc machines will work in that case.
Comment #42
Posted on Nov 28, 2012 by Happy RabbitI reimplemented PR_StackPush/Pop using OSSpinLock after having examined the kernel code for that - it was made exactly for that purpose, but it applies memory barriers and does syncing on SMP systems.
include
static OSSpinLock stackLock = OS_SPINLOCK_INIT;
PR_IMPLEMENT(void) PR_StackPush(PRStack *stack, PRStackElem *stack_elem) { OSSpinLockLock(&stackLock); stack_elem->prstk_elem_next = stack->prstk_head.prstk_elem_next; stack->prstk_head.prstk_elem_next = stack_elem; OSSpinLockUnlock(&stackLock); return; }
PR_IMPLEMENT(PRStackElem *) PR_StackPop(PRStack *stack) { PRStackElem *element;
OSSpinLockLock(&stackLock);
element = stack->prstk_head.prstk_elem_next;
if (element != NULL) {
stack->prstk_head.prstk_elem_next = element->prstk_elem_next;
element->prstk_elem_next = NULL; /* debugging aid */
}
OSSpinLockUnlock(&stackLock);
return element;
}
Comment #43
Posted on Nov 28, 2012 by Massive RhinoLanded spinlock change and Chromium atomics. Looks good so far.
Comment #44
Posted on Nov 28, 2012 by Massive RhinoOkay, I'm pretty satisfied with how it's acting. QTE works fine, SSL and SPDY look good. Flipping over to build 7450. If that's good too, we'll build 7400 and G3 and make this a second beta.
Comment #45
Posted on Nov 30, 2012 by Happy LionI don't know if it's caused by this bug, but https://productforums.google.com/forum/#!categories/websearch loads veeeeery slowly (ca. 50 s.) in TFF 17.0 on my Mac 7450.
Comment #46
Posted on Nov 30, 2012 by Massive RhinoNo. The new Google Groups (which includes the product forums) is just extremely JavaScript heavy. In fact, the changes above do improve its performance, though it is still slow.
Comment #47
Posted on Nov 30, 2012 by Happy RabbitNo problems with that site here on Aurora 19 on a G4-7450@1.5GHz , even when it's building mozilla in the background.
Comment #48
Posted on Nov 30, 2012 by Happy RabbitInteresting this ancient mozilla bug: https://bugzilla.mozilla.org/show_bug.cgi?id=194339 Though our solution is far more portable - and NSPR itself is rather poorly maintained.
Comment #49
Posted on Nov 30, 2012 by Massive Rhino(No comment was entered for this change.)
Comment #50
Posted on Nov 30, 2012 by Massive RhinoYeah, their code is actually a very poor performer because it has a sync right in the middle of it. Yours is much higher performance. Anyway, I'm going to close this since we're shipping the code, and look in more detail at the semaphore wait lag in issue 193.
Status: Verified
Labels:
Type-Defect
Priority-High
Milestone-NextStable