Export to GitHub

tenfourfox - issue #191

Convert NSPR and Chromium mutexes to kernel atomic operations


Posted on Nov 16, 2012 by Massive Rhino

Between 10 and 17, certain sites regressed mightily in performance on MP systems. The most egregious was ifixit, which renders fine on a 1GHz iMac G4 and horribly slowly on a quad G5. Shark shows almost 47% of runtime spent waiting for semaphore locks. Mozilla changed JS from its own lock to NSPR locks in this timeframe; JS used to use libkern ATOMIC locks.

Stubbing the NSPR locks in JS to call libkern ATOMIC locks again, plus adding a small customized atomic set routine (ATOMIC in 10.4 does not have an atomic set, just an atomic compare-and-swap), restores almost all of the performance. This is being shipped in 17.0.

We should simply replace NSPR locks with libkern locks in the whole system, not just JavaScript, which should overcome this bottleneck particularly in NSS which uses them heavily. In particular, we should replace PR_ATOMIC_INCREMENT, _DECREMENT, _ADD and _SET with OSAtomicIncrement32Barrier, OSAtomicDecrement32Barrier, OSAtomicAdd32Barrier and TenFourFoxAtomicSet, and move TenFourFoxAtomicSet out of JS and into NSPR. This should land on 17.0.1 and 18. It will probably marginally help uniprocessor systems, but it will definitely help multiprocessor machines.

Comment #1

Posted on Nov 16, 2012 by Massive Rhino

Mozilla is trying to purge locks in JS, period, in Fx19: https://bugzilla.mozilla.org/show_bug.cgi?id=772722

Still, we still want this for the rest of the browser.

Comment #2

Posted on Nov 16, 2012 by Massive Rhino

And, looks like bug 675078 was to blame, which didn't reland until earlier this year. https://bugzilla.mozilla.org/show_bug.cgi?id=675078

Wonder why we didn't notice this in Fx12. Anyway.

Comment #3

Posted on Nov 17, 2012 by Massive Rhino

Undoubtedly some of this is also due to the poor performance of PR_Lock, which is implemented in terms of pthread_mutex and this is a known bad operator on OS X. So let's limit this to NSPR atomic operations for now.

Comment #4

Posted on Nov 17, 2012 by Massive Rhino

And, for that matter, the Atomics are also internally implemented in terms of pthread_mutex_lock, so simply removing Atomics from the rest of the source should improve things. There's a lot of PR_Lock in hashtables though.

Comment #5

Posted on Nov 17, 2012 by Massive Rhino

Hmm. In pratom.h,

elif ((GNUC > 4) || (GNUC == 4 && GNUC_MINOR >= 1)) && \

So it's only because we're still on gcc 4.0.1. This should "just work" on gcc 4.6+. Thus this needs to only be done to 17.

Comment #6

Posted on Nov 18, 2012 by Massive Rhino

Here's another site that threads crash into each other on: www.kfiam640.com

Comment #7

Posted on Nov 19, 2012 by Massive Rhino

After patching this out, overhead drops a bit, but we're still spending ~40% of our time in semaphore lock on the affected sites. Shark looks like this:

0.0%    95.2%   libSystem.B.dylib   _pthread_body   
0.0%    90.5%   libnspr4.dylib   PR_Select  
0.0%    61.9%   XUL   XRE_AddStaticComponent    
0.0%    57.1%   XUL    0x3eef784 [176B] 
0.0%    57.1%   XUL     XRE_AddStaticComponent  
0.0%    47.6%   XUL      XRE_AddStaticComponent 
0.0%    42.9%   libnspr4.dylib        PR_Wait   
0.0%    42.9%   libnspr4.dylib         PR_WaitCondVar   
0.0%    42.9%   libSystem.B.dylib           pthread_cond_wait   
42.9%   42.9%   libSystem.B.dylib            semaphore_wait_signal_trap 
0.0%    4.8%    libnspr4.dylib        PR_WaitCondVar    
0.0%    4.8%    XUL      NS_CycleCollectorSuspect_P 
0.0%    4.8%    XUL      0x299ec58 [788B]   
0.0%    4.8%    libnspr4.dylib     PR_WaitCondVar   
0.0%    9.5%    XUL   0x29b2afc [884B]  
0.0%    4.8%    XUL   XRE_StartupTimelineRecord 
0.0%    4.8%    XUL   js_GetScriptLineExtent(JSScript*) 
0.0%    4.8%    XUL   js::IterateChunks(JSRuntime*, void*, void (*)(JSRuntime*, void*, js::gc::Chunk*)) 
0.0%    4.8%    XUL   js::IndirectWrapper::toWrapper()  
0.0%    4.8%    XUL  std::deque<FilePath, std::allocator<FilePath> >::_M_push_back_aux(FilePath const&) 
0.0%    4.8%    firefox start   

Comment #8

Posted on Nov 20, 2012 by Massive Rhino

After review, we actually do want to keep using OS X internal atomics even after the compiler switch because OS X self-optimizes the atomics on uniprocessor macs and the gcc intrinsics don't. We will also have G3/G4 and G5 versions of the atomic set (G3/G4 using eieio/isync, G5 using lwsync/isync).

Comment #9

Posted on Nov 20, 2012 by Massive Rhino

The fully barriered AtomicSet reduces V8 by nearly 10%. That's bad, and we can actually test JavaScript. So let's create a non-barrier AtomicSet and use that, and give JavaScript non-barriered OS X atomics also, and keep the memory barrier version for the rest of the browser.

Comment #10

Posted on Nov 24, 2012 by Massive Rhino

Initial version done for 17.0.1pre

Comment #11

Posted on Nov 26, 2012 by Happy Rabbit

Did you notice that the existing but (currently) unused implementations of the NSPR atomics are non barrier versions? (see nsprpub/pr/src/md/unix/os_Darwin_ppc.s) And there is also an implementation wrapped around OSAtomic functions, without barriers as well. It's ifdef'ed to be used on ARM CPUs only. This one is actually in use - only ppc (32 bit only), i386 and x86_64 use the gcc intrinsics instead of the existing machine dependent implementations.

I really can't see (int the Aurora 19 sources) how even with gcc 4.0.1 on 10.4 NSPR would build the PR_Lock code path. There must be something going wrong in nsprpub/pr/include/md/_darwin.h where _PR_HAVE_ATOMIC_OPS gets defined which should cause the Atomics in nsprpub/pr/src/md/unix/os_Darwin_ppc.s to be used .

The attached patch makes NSPR use OSAtomic functions for all OS X versions from 10.4 on independent of the architecture, the way it was already implemented for ARM CPUs. Furthermore it provides implementations of PR_StackPush and PR_StackPop copied and adapted from their AIX counterpart, getting rid of a nasty dependency on PR_Lock. I wonder though whether the non barrier version is really the appropriate one. I made PR_StackPush and PR_StackPop use the barrier version because the AIX version does explicitely use the barrier versions while non barrier versions are also available there.

So far the browser works as always and performance is good. PR_StackPush and PR_StackPop are used quite frequently from what I see when setting breakpoints in them.

Attachments

Comment #12

Posted on Nov 26, 2012 by Massive Rhino

I'll definitely take the PR_Stack* ops, anything to reduce the use of PR_Lock. I'll have to study them later to see how they work, but I'm all for anything that gets us out of mutex hell.

I'm not as happy with the _MD_ATOMIC_SET: I'd rather use our assembler ones because we know exactly what is getting emitted. It doesn't get a lot simpler than lwarx/stwcx. in a loop. But if the browser seems okay with a non-barrier atomic_set, we could just use that everywhere. (It seems to be fine, but there's a lot of atomic usage in NSS and I really don't want to get that wrong -- maybe the best solution is to try to get NSPR and NSS testing working at the same time.)

Comment #13

Posted on Nov 26, 2012 by Happy Rabbit

The advantage of _MD_ATOMIC_SET as it is implemented upstream is however that the kernel does its multi-/uniprocessor optimization. I knew you won't be too happy with it. The optimizations to AtomicSet should definitely go into nsprpub/pr/src/md/unix/os_Darwin_ppc.s:__PR_DarwinPPC_AtomicSet . My patch also gets the NSPR tests working here. Just "(g)make runtests" in {build root}/nsprpub/pr/tests .

Comment #14

Posted on Nov 26, 2012 by Massive Rhino

That's true (about the optimizations) but if we can get away with no barriers everywhere, then we don't need it because we'll already be working barrier-free.

I know about os_Darwin_ppc.s, but for some reason the build system was not picking up my changes, maybe a bug, which is why I implemented it in os_Darwin.s instead. I'll have to mess with that.

Looks like the PR_Stack* usage comes from NSPR's file descriptor cache, so that makes sense they get used a lot.

Comment #15

Posted on Nov 26, 2012 by Happy Rabbit

As we're already dealing with functions in the so called comm page. The objective-C/C++ direct dispatch flag isn't fully implemented in gcc-4.6 and later. It generates calls to objc_msgSend_Fast and objc_assign_ivar_Fast which are just stubs that Apple gcc replaces with local branches to 0xFFFEFF00 or 0xFFFEFEC0. I already looked into that and it seems that can be implemented as builtins.

Comment #16

Posted on Nov 26, 2012 by Massive Rhino

Is that needed to compile? I'm still ripping out the 4.0.1 stuff from the 17 changesets, so I'm at least a week or two away from even trying to build 19.

Comment #17

Posted on Nov 26, 2012 by Happy Rabbit

No it's not needed as long as you don't pass the flag -fobjc-direct-dispatch to gcc. Direct dispatching was only used in the 32 bit Objective-C runtime so that might be the reason it wasn't implemented in current gcc. It's just a matter of performance once again.

Comment #18

Posted on Nov 26, 2012 by Massive Rhino

Ah, ok. We should spin that off into a separate issue.

Comment #19

Posted on Nov 26, 2012 by Happy Rabbit

I'll open an issue once I know more - maybe I'll first try to get the WebKit java script interpreter working... The browser is working really well. Almost feels full speed although the machine is building Aurora in the background. Haven't tested yet without the build running.

Comment #20

Posted on Nov 27, 2012 by Massive Rhino

Failing lots of tests, though I don't know what the state was prior to this.

Comment #21

Posted on Nov 27, 2012 by Massive Rhino

(so far failing bigfile3, dlltest {even with your patch}, getai, libfilename, mbcs, and hangs up on nameshm1)

Comment #22

Posted on Nov 27, 2012 by Massive Rhino

Semaphores are failing bigtime; they just hang. I need to back this out and see if it passed to begin with.

Comment #23

Posted on Nov 27, 2012 by Massive Rhino

Nope, they don't. In fact, they just hang until they time out. This might have been the whole problem!

Hacking pr/src/pthreads/ptsynch.c, it seems that sem_otime never gets set to a non-zero value (but sem_ctime does). From bug 469744 https://bugzilla.mozilla.org/show_bug.cgi?id=469744, it doesn't look like semaphores were ever properly tested on 10.4. Changing it to sem_ctime causes it to succeed, and all the semaphore tests now pass.

https://bugzilla.mozilla.org/show_bug.cgi?id=370420 implies that failure of some tests is a normal phenomenon, but that really sucks.

Comment #24

Posted on Nov 27, 2012 by Massive Rhino

Apparently I'm not going insane: http://lists.apple.com/archives/darwin-dev/2005/Mar/msg00147.html

Comment #25

Posted on Nov 27, 2012 by Massive Rhino

After thinking it over, the ctime hack is going to lead to a race condition sooner or later. Since we have the choice of POSIX or SysV semaphores in 10.4, changed md/_darwin.h to POSIX semaphores, and the semaphore tests all pass. I'm going to leave the rest alone for now. I restored the PR_Stack* ops and we'll see how the browser behaves when it's done building.

Comment #26

Posted on Nov 27, 2012 by Happy Rabbit

I forgot to mention the failing tests - I didn't mind the failures because the browser is working well obviously. And I have to "mske clean" after doing any change to NSPR sources since dependency tracking doesn't seem to work there.

Comment #27

Posted on Nov 27, 2012 by Massive Rhino

Initial tests look good (with the change to PR_OpenSemaphore and your PR_Stack* ops, and removing all memory barriers). Obviously a lot less hanging in semaphores too. KFI is now bearable; I need to throw it through Shark again to see what's left to shake out, but this does look better. I'm going to do a more thorough test on it after work.

Comment #28

Posted on Nov 27, 2012 by Happy Rabbit

Comment deleted

Comment #29

Posted on Nov 27, 2012 by Happy Rabbit

Looking into the xnu sources shows that the sem_otime bug was fixed in 10.5 (http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-792.6.70 vs http://fxr.watson.org/fxr/source/bsd/kern/sysv_sem.c?v=xnu-1228). Though there's a difference between using POSIX and SYSV semaphores in the test results. With SYSV semaphores ranfile waits infinitely in pthread_wait_cond and with POSIX semaphores nonblock fails.

Comment #30

Posted on Nov 27, 2012 by Happy Rabbit

No, those two tests sometimes fail and sometimes pass independently of SYSV or POSIX semaphores. So both do work on 10.5 .

Comment #31

Posted on Nov 27, 2012 by Massive Rhino

If you use POSIX semaphores, does the browser still work on 10.5 or do you get stalls? It's going to be really hard to ship a browser that's dynamically selectable.

Comment #32

Posted on Nov 27, 2012 by Massive Rhino

(nvm, you just answered my question)

Comment #33

Posted on Nov 27, 2012 by Happy Rabbit

When you search the whole source code for OSAtomic you will find quite some usages of those functions - and all of them (except the upcoming unused jemalloc 3) use the barrier versions. And that said, on an architecture with out of order instruction scheduling a lock is a true lock only when it deals with that issue. For locking purposes you have to pay the price for using multiple CPUs and do the syncing and memory barriers.

I also found out that chromium IPC uses OSAtomic only on Intel OS X - I changed that to apply to OS X systems independent of the CPU architecture. I was surprised to find find out that cairo apparently uses pthread mutexes although there is code to use system atomics but the necessary defines aren't done anywhere. Is cairo used for anything more than a azure backend fallback?

Comment #34

Posted on Nov 28, 2012 by Massive Rhino

I'll fix the Chromium code. Cairo is used heavily for graphics still -- we don't have Azure content yet. Maybe we need to add atomic support there too.

However, assuming this quad G5 is the worst case for CPU on Power Mac, I haven't had any problems banging on it this afternoon. Web workers and multiple tabs loading are all rock solid. However, PR_Wait is still showing up a lot on kfi, so we haven't fixed its core issue though we've made a lot of other things a lot better.

Comment #35

Posted on Nov 28, 2012 by Massive Rhino

Actually, let's use barrier for Chromium IPC. If anything would break with a non-barrier sequence, it would be IPC.

Comment #36

Posted on Nov 28, 2012 by Happy Rabbit

Well, here is what I did to the Chromium IPC.

Attachments

Comment #37

Posted on Nov 28, 2012 by Massive Rhino

Why does the x86.h file need to be deleted?

Comment #38

Posted on Nov 28, 2012 by Happy Rabbit

No, I renamed it (just removed x86 from its name because it's nonsense) but that's not obvious in diffs.

Comment #39

Posted on Nov 28, 2012 by Happy Rabbit

I finally found out that in gfx/cairo/cairo/src/cairo-platform.h CAIRO_NO_MUTEX is defined which short circuits all mutexes in cairo. So there should be any problem in cairo at all.

Comment #40

Posted on Nov 28, 2012 by Massive Rhino

Too bad about Cairo. Did you make any changes to the x86 file? I might just simplify it for testing by just including it, and we can tidy it up later (not that Chromium changes a lot, but just to reduce merge burden).

Comment #41

Posted on Nov 28, 2012 by Happy Rabbit

No changes; just including the x86 header on ppc machines will work in that case.

Comment #42

Posted on Nov 28, 2012 by Happy Rabbit

I reimplemented PR_StackPush/Pop using OSSpinLock after having examined the kernel code for that - it was made exactly for that purpose, but it applies memory barriers and does syncing on SMP systems.

include

static OSSpinLock stackLock = OS_SPINLOCK_INIT;

PR_IMPLEMENT(void) PR_StackPush(PRStack *stack, PRStackElem *stack_elem) { OSSpinLockLock(&stackLock); stack_elem->prstk_elem_next = stack->prstk_head.prstk_elem_next; stack->prstk_head.prstk_elem_next = stack_elem; OSSpinLockUnlock(&stackLock); return; }

PR_IMPLEMENT(PRStackElem *) PR_StackPop(PRStack *stack) { PRStackElem *element;

OSSpinLockLock(&stackLock);
element = stack->prstk_head.prstk_elem_next;
if (element != NULL) {
    stack->prstk_head.prstk_elem_next = element->prstk_elem_next;
    element->prstk_elem_next = NULL;        /* debugging aid */
}
OSSpinLockUnlock(&stackLock);
return element;

}

Comment #43

Posted on Nov 28, 2012 by Massive Rhino

Landed spinlock change and Chromium atomics. Looks good so far.

Comment #44

Posted on Nov 28, 2012 by Massive Rhino

Okay, I'm pretty satisfied with how it's acting. QTE works fine, SSL and SPDY look good. Flipping over to build 7450. If that's good too, we'll build 7400 and G3 and make this a second beta.

Comment #45

Posted on Nov 30, 2012 by Happy Lion

I don't know if it's caused by this bug, but https://productforums.google.com/forum/#!categories/websearch loads veeeeery slowly (ca. 50 s.) in TFF 17.0 on my Mac 7450.

Comment #46

Posted on Nov 30, 2012 by Massive Rhino

No. The new Google Groups (which includes the product forums) is just extremely JavaScript heavy. In fact, the changes above do improve its performance, though it is still slow.

Comment #47

Posted on Nov 30, 2012 by Happy Rabbit

No problems with that site here on Aurora 19 on a G4-7450@1.5GHz , even when it's building mozilla in the background.

Comment #48

Posted on Nov 30, 2012 by Happy Rabbit

Interesting this ancient mozilla bug: https://bugzilla.mozilla.org/show_bug.cgi?id=194339 Though our solution is far more portable - and NSPR itself is rather poorly maintained.

Comment #49

Posted on Nov 30, 2012 by Massive Rhino

(No comment was entered for this change.)

Comment #50

Posted on Nov 30, 2012 by Massive Rhino

Yeah, their code is actually a very poor performer because it has a sync right in the middle of it. Yours is much higher performance. Anyway, I'm going to close this since we're shipping the code, and look in more detail at the semaphore wait lag in issue 193.

Status: Verified

Labels:
Type-Defect Priority-High Milestone-NextStable