Export to GitHub

tenfourfox - issue #23

G5-specific nanojit profiling


Posted on Jan 18, 2011 by Massive Rhino

Find those operations that are faster on G5 with the nanojit. Dromaeo sans SunSpider is a win even for G5, so we know they exist. Spun off issue 20.

Comment #1

Posted on Jan 24, 2011 by Massive Rhino

SunSpider profile, with traces that perform the same or better (unmarked traces perform worse). Next: see if there is a LIR commonality between the bad traces.

============================================

RESULTS (means and 95% confidence intervals)

Total: 5242.2ms +/- 0.3%

3d: 990.2ms +/- 0.5% cube: 174.5ms +/- 1.1% morph: 634.0ms +/- 0.7% raytrace: 181.7ms +/- 0.9%

access: 1121.7ms +/- 1.0% binary-trees: 53.8ms +/- 1.9% < same fannkuch: 702.0ms +/- 1.4% nbody: 62.2ms +/- 1.8% < better nsieve: 303.7ms +/- 1.4%

bitops: 812.2ms +/- 1.1% 3bit-bits-in-byte: 56.9ms +/- 1.6% < better bits-in-byte: 216.8ms +/- 2.0% bitwise-and: 259.4ms +/- 1.0% nsieve-bits: 279.1ms +/- 2.1%

controlflow: 64.1ms +/- 0.8% recursive: 64.1ms +/- 0.8% < same

crypto: 264.3ms +/- 1.1% aes: 180.4ms +/- 1.4% md5: 35.4ms +/- 1.7% < better sha1: 48.5ms +/- 1.0% < better

date: 210.4ms +/- 1.6% format-tofte: 166.5ms +/- 2.0% format-xparb: 43.9ms +/- 2.2% < better

math: 732.5ms +/- 0.7% cordic: 387.9ms +/- 1.1% partial-sums: 117.1ms +/- 0.8% < same spectral-norm: 227.5ms +/- 0.5%

regexp: 568.2ms +/- 0.3% dna: 568.2ms +/- 0.3% < same

string: 478.6ms +/- 0.5% base64: 92.6ms +/- 1.3% fasta: 111.3ms +/- 0.9% tagcloud: 114.7ms +/- 1.2% < same unpack-code: 100.8ms +/- 0.7% < same validate-input: 59.2ms +/- 1.1% < same

Comment #2

Posted on Jan 24, 2011 by Massive Rhino

For comparison,

============================================

RESULTS (means and 95% confidence intervals)

Total: 3377.9ms +/- 0.4%

3d: 425.0ms +/- 0.7% cube: 157.1ms +/- 0.5% morph: 144.7ms +/- 1.3% raytrace: 123.2ms +/- 1.0%

access: 592.4ms +/- 0.3% binary-trees: 53.3ms +/- 1.7% fannkuch: 311.1ms +/- 0.3% nbody: 136.5ms +/- 0.8% nsieve: 91.5ms +/- 0.7%

bitops: 493.0ms +/- 0.7% 3bit-bits-in-byte: 102.7ms +/- 0.9% bits-in-byte: 130.3ms +/- 0.7% bitwise-and: 87.6ms +/- 1.3% nsieve-bits: 172.4ms +/- 1.5%

controlflow: 64.2ms +/- 0.9% recursive: 64.2ms +/- 0.9%

crypto: 215.8ms +/- 0.4% aes: 90.7ms +/- 0.6% md5: 60.0ms +/- 0.6% sha1: 65.1ms +/- 1.0%

date: 147.1ms +/- 1.5% format-tofte: 87.6ms +/- 0.6% format-xparb: 59.5ms +/- 3.5%

math: 441.3ms +/- 2.5% cordic: 211.7ms +/- 1.1% partial-sums: 121.9ms +/- 8.4% spectral-norm: 107.7ms +/- 0.7%

regexp: 567.1ms +/- 0.2% dna: 567.1ms +/- 0.2%

string: 432.0ms +/- 0.6% base64: 63.4ms +/- 1.8% fasta: 101.6ms +/- 0.8% tagcloud: 112.1ms +/- 1.7% unpack-code: 97.8ms +/- 1.0% validate-input: 57.1ms +/- 1.1%

Comment #3

Posted on Jan 26, 2011 by Massive Rhino

Analysis of JSOPs that were not used in the same or better tests:

used: ursh used:

not used: ne used: ifeq used: moreiter used: le used: not used: dup2 used: string used: double used: trace used: bindgname used: bitxor used: setprop not used: lineno not used: uint24 used: eq used: neg used: bitor used: ifne used: setarg not used: top used: one used: getelem used: callarg used: and used: ge used: int8 not used: lambda used: callgname not used: gnameinc used: true not used: getfcslot used: rop used: callglobal used: forlocal used: bitnot used: zero used: enditer used: getglobal used: notrace not used: localdec not used: prop used: ng used: length used: regexp used: getthisprop used: gt used: initelem used: pop not used: deflocalfun used: mod used: getlocal used: bitand used: false used: newarray used: imtop used: or used: incgname used: setlocal used: getgname used: new not used: calllocal used: this used: iter used: getarg used: lsh used: null used: localinc used: lt used: push used: nullblockchain used: uint16 used: div used: rsh used: callprop not used: nop used: add used: callname used: getlocalprop used: mul used: call used: goto not used: eval used: setgname used: stop used: getprop used: setelem used: return used: sub used: endinit used: inclocal

Comment #4

Posted on Jan 26, 2011 by Massive Rhino

A sample build with JSOP_GETFCSLOT, JSOP_LAMBDA, JSOP_DEFLOCALFUN, and JSOP_CALLLOCAL reduced to ARECORD_ABORTED in jstracer.cpp showed dramatically faster JS across the board. Time to figure out the actual offender of the four -- or it could be all of them. However, we now have TraceMonkey benching better than interpreter for the first time on G5!!! Let's do this for beta 11!

Comment #5

Posted on Jan 26, 2011 by Massive Rhino

Unfortunately the speed was only in debug mode, actual browser performance did improve but only from 5200 to around 4700. To get a significant win, we need to be under 3000.

JSOPs audit: the slow ones appear to be JSOP_LINENO (???), _UINT24, _CALLLOCAL and _GNAMEDEC/INC (LINENO is uncertain because I don't have good testing coverage for it). The other ops made little difference if on or off, and some got worse.

The next steps are: 1) Look at the instructions used by the faster ones only, and abort tracing for the other ops. This may not be possible. 2) These ones seem to have stack issues. Perhaps the stack is the problem, but I'm not sure yet.

Comment #6

Posted on Jan 30, 2011 by Massive Rhino

Current set of blacklisted JSOPs: NEG, anything calling setElem, CALLNAME, LINENO, UINT24, CALLLOCAL, GNAMEDEC, GNAMEINC. This gets us to 3700ms in SunSpider and wins on both Dromaeo and V8, so this is good enough to ship.

Comment #7

Posted on Jan 31, 2011 by Massive Rhino

changing flags

Comment #8

Posted on Jan 31, 2011 by Massive Rhino

On our internal pull, RealClearPolitics has trouble with clicking on links. This does work in b9 with the nanojit on. Not sure if it's our blacklist or the interpreter, so making a note to recheck this after our next pull.

Comment #9

Posted on Feb 3, 2011 by Massive Rhino

Fixed by pull, so conclude Mozilla bug.

Comment #10

Posted on Mar 22, 2011 by Massive Rhino

Dropping priority as we appear to have reached a maximum for G5.

Comment #11

Posted on Apr 5, 2011 by Massive Rhino

Here is something interesting, from glibc:

/* long int[r3] __lrint (double x[fp1]) / ENTRY (__lrint) stwu r1,-16(r1) fctiw fp13,fp1 stfd fp13,8(r1) nop / Insure the following load is in a different dispatch group / nop / to avoid pipe stall on POWER4&5. */ nop lwz r3,12(r1) addi r1,r1,16 blr END (__lrint)

This might be useful for ::asm_d2i -- we could insert some nop()s there.

Comment #12

Posted on Apr 5, 2011 by Massive Rhino

Other interesting optimizations:

http://sourceware.org/ml/libc-ports/2005-12/msg00004.html mtctr rTMP /* Power4 wants mtctr 1st in dispatch group */

And they do use the same trick for fctid: +ENTRY (__llrintf)
+ CALL_MCOUNT + fctid fp13,fp1 + stfd fp13,-8(r1) + nop /* Insure the following load is in a different dispatch group / + nop / to avoid pipe stall on POWER4&5. */ + nop + lwz r3,-8(r1) + lwz r4,-4(r1)
+ blr + END (__llrintf)

Comment #13

Posted on Apr 5, 2011 by Massive Rhino

And we also need to get MCRXR out of the nanojit, it is NOT native on G5! Argh! No wonder the G4 runs rings around it! We should replace it with equivalent mtxer and mfxer (i.e, mfspr rT,1 and mtspr 1,rT) for G5. Something like

  • mfxer Rx
  • mtcrf 0, Rx and, if we need the XER cleared (we probably should),
  • rlwinm Rx,Rx,0,0,28
  • mtxer Rx

should work ...

http://www.macintouch.com/tiger20.html and from Common Lisp, (in-package "CCL")

(defppclapfunction do-mcrxr ((n arg_z)) loop (cmpwi :cr1 arg_z '1) (mcrxr 0) (subi arg_z arg_z '1) (bge :cr1 loop) (blr))

(defppclapfunction do-mtxer ((n arg_z)) loop (cmpwi :cr1 arg_z '1) (mtxer rzero) (subi arg_z arg_z '1) (bge :cr1 loop) (blr))

;;; (time (do-mcrxr 100000000)) ;;; (time (do-mtxer 100000000))

Comment #14

Posted on Apr 5, 2011 by Massive Rhino

Apple code implies that mtcrf is okay for individual CR fields, iff it is one bitfield. From http://www.opensource.apple.com/source/xnu/xnu-1456.1.26/osfmk/ppc/bcopy.s

shortcopy: cmplw r12,r5 ; must move reverse if (dest-source)

Although the modified code should work for G4/G3, we will keep mcrxr on those systems to reduce icache pressure.

Comment #15

Posted on Apr 8, 2011 by Massive Rhino

BASE Richards: 144 DeltaBlue: 210 Crypto: 107 RayTrace: 407

EarleyBoyer: 521

Score: 233

SunSpider now 1760

MTCRF (swapon) Richards: 1551 DeltaBlue: 479 Crypto: 915 RayTrace: 351

EarleyBoyer: 478

Score: 648

MTCRF (swapoff) Richards: 1560 DeltaBlue: 480 Crypto: 910 RayTrace: 352

EarleyBoyer: 479

Score: 649

MCRXR (swapon) Richards: 679 DeltaBlue: 318 Crypto: 22.3 RayTrace: 344

EarleyBoyer: 461

Score: 238

We keep the swap. We lose the mcrxr for G5. Everybody wins.

VERIFIED

Status: Verified

Labels:
Type-Defect Priority-Medium