My favorites | Sign in
Project Home Wiki Issues Source
Repository:
Checkout   Browse   Changes   Clones  
Changes to /ramblings/blogs-optimizing-3.md
cc90753d6900 vs. d88764f9d066 Compare: vs.  Format:
Revision d88764f9d066
Go to: 
Project members, sign in to write a code review
/ramblings/blogs-optimizing-3.md   cc90753d6900 /ramblings/blogs-optimizing-3.md   d88764f9d066
1 nbody - Unrolling AELEM loops to AELEMFAST 1 nbody - Unrolling AELEM loops to AELEMFAST
2 ------------------------------------------ 2 ------------------------------------------
3 3
4 In the [first 4 In the [first
5 part](http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html) 5 part](http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html)
6 I showed some problems and possibilities of the B::C compiler and 6 I showed some problems and possibilities of the B::C compiler and
7 B::CC optimizing compiler with an regexp example which was very bad to 7 B::CC optimizing compiler with an regexp example which was very bad to
8 optimize. 8 optimize.
9 9
10 In the [second 10 In the [second
11 part](http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html) 11 part](http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-2.html)
12 I got 2 times faster run-times with the B::CC compiler with the 12 I got 2 times faster run-times with the B::CC compiler with the
13 [nbody](http://shootout.alioth.debian.org/u32/performance.php?test=nbody) benchmark, which does a lot of arithmetic. 13 [nbody](http://shootout.alioth.debian.org/u32/performance.php?test=nbody) benchmark, which does a lot of arithmetic.
14 14
15 Two open problems were detected: slow function calls, and slow array accesses. 15 Two open problems were detected: slow function calls, and slow array accesses.
16 16
17 At first I inlined the function call which is called the most, `sub advance` 17 At first I inlined the function call which is called the most, `sub advance`
18 which was called N times, N being 5000, 50.000 or 50.000.000. 18 which was called N times, N being 5000, 50.000 or 50.000.000.
19 19
20 for (1..$n) { 20 for (1..$n) {
21 advance(0.01); 21 advance(0.01);
22 } 22 }
23 23
24 The runtime with N=50.000.000 went from 22m13.754s down to 21m48.015s, 24 The runtime with N=50.000.000 went from 22m13.754s down to 21m48.015s,
25 25s less. This is not what I wanted. 25 25s less. This is not what I wanted.
26 php and jruby are at 12 min and 9m. So it is not slow functions calls, 26 php and jruby are at 12 min and 9m. So it is not slow functions calls,
27 it is slow array access. 27 it is slow array access.
28 Inspecting the opcodes shows that a lot of AELEM ops are used, for 28 Inspecting the opcodes shows that a lot of AELEM ops are used, for
29 reading and writing arrays. 29 reading and writing arrays.
30 30
31 AELEM checks for lvalue invocation and several more flags, which do 31 AELEM checks for lvalue invocation and several more flags, which do
32 exist at compile-time and there exists a fast version already 32 exist at compile-time and there exists a fast version already
33 AELEMFAST, but this only operates on literal constant indices, already 33 AELEMFAST, but this only operates on literal constant indices, already
34 known at compile-time. The index is stored at compile-time in the 34 known at compile-time. The index is stored at compile-time in the
35 op->private flag then. 35 op->private flag then.
36 36
37 So instead of 37 So instead of
38 38
39 for (my $j = $i + 1; $j < $last + 1; $j++) { 39 for (my $j = $i + 1; $j < $last + 1; $j++) {
40 # inner-loop $j..4 40 # inner-loop $j..4
41 $dx = $xs[$i] - $xs[$j]; 41 $dx = $xs[$i] - $xs[$j];
42 $dy = $ys[$i] - $ys[$j]; 42 $dy = $ys[$i] - $ys[$j];
43 $dz = $zs[$i] - $zs[$j]; 43 $dz = $zs[$i] - $zs[$j];
44 ... 44 ...
45 45
46 One could generate a macro-like string which just evals the array indices and generate from this 46 One could generate a macro-like string which just evals the array indices and generate from this
47 string the final function. 47 string the final function.
48 48
49 Array accesses: `$a[const]` are optimized to AELEMFAST, `$a[$lexical]` not. 49 Array accesses: `$a[const]` are optimized to AELEMFAST, `$a[$lexical]` not.
50 So unroll the loop in macro-like fashion. 50 So unroll the loop in macro-like fashion.
51 51
52 $energy = ' 52 $energy = '
53 sub energy 53 sub energy
54 { 54 {
55 my $e = 0.0; 55 my $e = 0.0;
56 my ($dx, $dy, $dz, $distance);'; 56 my ($dx, $dy, $dz, $distance);';
57 for my $i (0 .. $last) { 57 for my $i (0 .. $last) {
58 $energy .= " 58 $energy .= "
59 # outer-loop $i..4 59 # outer-loop $i..4
60 \$e += 0.5 * \$mass[$i] * 60 \$e += 0.5 * \$mass[$i] *
61 (\$vxs[$i] * \$vxs[$i] + \$vys[$i] * \$vys[$i] + \$vzs[$i] * \$vzs[$i]); 61 (\$vxs[$i] * \$vxs[$i] + \$vys[$i] * \$vys[$i] + \$vzs[$i] * \$vzs[$i]);
62 "; 62 ";
63 for (my $j = $i + 1; $j < $last + 1; $j++) { 63 for (my $j = $i + 1; $j < $last + 1; $j++) {
64 $energy .= " 64 $energy .= "
65 # inner-loop $j..4 65 # inner-loop $j..4
66 \$dx = \$xs[$i] - \$xs[$j]; 66 \$dx = \$xs[$i] - \$xs[$j];
67 \$dy = \$ys[$i] - \$ys[$j]; 67 \$dy = \$ys[$i] - \$ys[$j];
68 \$dz = \$zs[$i] - \$zs[$j]; 68 \$dz = \$zs[$i] - \$zs[$j];
69 \$distance = sqrt(\$dx * \$dx + \$dy * \$dy + \$dz * \$dz); 69 \$distance = sqrt(\$dx * \$dx + \$dy * \$dy + \$dz * \$dz);
70 \$e -= (\$mass[$i] * \$mass[$j]) / \$distance;"; 70 \$e -= (\$mass[$i] * \$mass[$j]) / \$distance;";
71 } 71 }
72 } 72 }
73 $energy .= ' 73 $energy .= '
74 return $e; 74 return $e;
75 }'; 75 }';
76 eval $energy; die if $@; 76 eval $energy; die if $@;
77 77
78 Every `$i` and `$j` got expanded into a literal, 0 .. 4. 78 Every `$i` and `$j` got expanded into a literal, 0 .. 4.
79 79
80 I did this loop unrolling for the three functions, and the results 80 I did this loop unrolling for the three functions, and the results
81 were impressive. It is a nice little macro trick which you could use 81 were impressive. It is a nice little macro trick which you could use
82 for normal uncompiled perl code also. With compiled code the 82 for normal uncompiled perl code also. With compiled code the
83 loop-unrolling should happen automatically. 83 loop-unrolling should happen automatically.
84 84
85 Full code here: [nbody.perl-2.perl](https://github.com/rurban/shootout/commit/62b216756320e8c224eef2c933326924ab73c18a) 85 Full code here: [nbody.perl-2.perl](https://github.com/rurban/shootout/commit/62b216756320e8c224eef2c933326924ab73c18a)
86 86
87 Original: 87 Original:
88 88
89 $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl 50000 89 $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl 50000
90 script/perlcc: c time: 0.380353 90 script/perlcc: c time: 0.380353
91 script/perlcc: cc time: 0.967525 91 script/perlcc: cc time: 0.967525
92 -0.169075164 92 -0.169075164
93 -0.169078071 93 -0.169078071
94 script/perlcc: r time: 2.214327 94 script/perlcc: r time: 2.214327
95 95
96 Unrolled: 96 Unrolled:
97 97
98 $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl-2.perl 50000 98 $ perlcc --time -r -O -S -O1 --Wb=-fno-destruct,-Uwarnings,-UB,-UCarp ../shootout/bench/nbody/nbody.perl-2.perl 50000
99 script/perlcc: c time: 0.448817 99 script/perlcc: c time: 0.448817
100 script/perlcc: cc time: 2.167499 100 script/perlcc: cc time: 2.167499
101 -0.169075164 101 -0.169075164
102 -0.169078071 102 -0.169078071
103 script/perlcc: r time: 1.341283 103 script/perlcc: r time: 1.341283
104 104
105 Another **2x times faster!** 105 Another **2x times faster!**
106 106
107 For comparison the same effect uncompiled: 107 For comparison the same effect uncompiled:
108 108
109 $ time perl ../shootout/bench/nbody/nbody.perl 50000 109 $ time perl ../shootout/bench/nbody/nbody.perl 50000
110 -0.169075164 110 -0.169075164
111 -0.169078071 111 -0.169078071
112 112
113 real 0m3.650s 113 real 0m3.650s
114 user 0m3.644s 114 user 0m3.644s
115 sys 0m0.000s 115 sys 0m0.000s
116 116
117 Unrolled: 117 Unrolled:
118 118
119 $ time perl ../shootout/bench/nbody/nbody.perl-2.perl 50000 119 $ time perl ../shootout/bench/nbody/nbody.perl-2.perl 50000
120 -0.169075164 120 -0.169075164
121 -0.169078071 121 -0.169078071
122 122
123 real 0m2.399s 123 real 0m2.399s
124 user 0m2.388s 124 user 0m2.388s
125 sys 0m0.004s 125 sys 0m0.004s
126 126
127 So we went from **3.6s** down to **2.4s** and compiled to **1.3s**. 127 So we went from **3.6s** down to **2.4s** and compiled to **1.3s**.
128 128
129 With N=50,000,000 we got **14m12.653s** uncompiled and **7m11.3597s** 129 With N=50,000,000 we got **14m12.653s** uncompiled and **7m11.3597s**
130 compiled. Close to jruby, even if the array accesses still goes 130 compiled. Close to jruby, even if the array accesses still goes
131 through the `av_fetch` function, magic is checked and undefined indices 131 through the `av_fetch` function, magic is checked and undefined indices
132 are autovivified. 132 are autovivified.
133 133
134 134
135 Generalization 135 Generalization
136 -------------- 136 --------------
137 137
138 The above macro-code code looks pretty unreadable, similar to lisp 138 The above macro-code code looks pretty unreadable, similar to lisp
139 macros, with its mix of quoted and unquoted variables. The compiler 139 macros, with its mix of quoted and unquoted variables. The compiler
140 needs to detect unrollable loop code which will lead to more 140 needs to detect unrollable loop code which will lead to more
141 constants and AELEMFAST ops. And we better define a helper function 141 constants and AELEMFAST ops. And we better define a helper function
142 for easier generation of such unrolled loops. 142 for easier generation of such unrolled loops.
143 143
144 # unquote local vars 144 # unquote local vars
145 sub qv { 145 sub qv {
146 my ($s, $env) = @_; 146 my ($s, $env) = @_;
147 # expand our local loop vars 147 # expand our local loop vars
148 $s =~ s/(\$\w+?)\b/exists($env->{$1})?$env->{$1}:$1/sge; 148 $s =~ s/(\$\w+?)\b/exists($env->{$1})?$env->{$1}:$1/sge;
149 $s 149 $s
150 } 150 }
151 151
152 $energy = ' 152 $energy = '
153 sub energy 153 sub energy
154 { 154 {
155 my $e = 0.0; 155 my $e = 0.0;
156 my ($dx, $dy, $dz, $distance);'; 156 my ($dx, $dy, $dz, $distance);';
157 for my $i (0 .. $last) { 157 for my $i (0 .. $last) {
158 my $env = {'$i'=>$i,'$last'=>$last}; 158 my $env = {'$i'=>$i,'$last'=>$last};
159 $energy .= qv(' 159 $energy .= qv('
160 # outer-loop $i..4 160 # outer-loop $i..4
161 $e += 0.5 * $mass[$i] * 161 $e += 0.5 * $mass[$i] *
162 ($vxs[$i] * $vxs[$i] + $vys[$i] * $vys[$i] + $vzs[$i] * $vzs[$i]);', $env); 162 ($vxs[$i] * $vxs[$i] + $vys[$i] * $vys[$i] + $vzs[$i] * $vzs[$i]);', $env);
163 for (my $j = $i + 1; $j < $last + 1; $j++) { 163 for (my $j = $i + 1; $j < $last + 1; $j++) {
164 $env->{'$j'} = $j; 164 $env->{'$j'} = $j;
165 $energy .= qv(' 165 $energy .= qv('
166 # inner-loop $j..4 166 # inner-loop $j..4
167 $dx = $xs[$i] - $xs[$j]; 167 $dx = $xs[$i] - $xs[$j];
168 $dy = $ys[$i] - $ys[$j]; 168 $dy = $ys[$i] - $ys[$j];
169 $dz = $zs[$i] - $zs[$j]; 169 $dz = $zs[$i] - $zs[$j];
170 $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz); 170 $distance = sqrt($dx * $dx + $dy * $dy + $dz * $dz);
171 $e -= ($mass[$i] * $mass[$j]) / $distance;', $env); 171 $e -= ($mass[$i] * $mass[$j]) / $distance;', $env);
172 } 172 }
173 } 173 }
174 $energy .= ' 174 $energy .= '
175 return $e; 175 return $e;
176 }'; 176 }';
177 eval $energy; die if $@; 177 eval $energy; die if $@;
178 178
179 This looks now much better and leads in a BEGIN block to only neglectible 179 This looks now much better and leads in a BEGIN block to only neglectible
180 run-time penalty. 180 run-time penalty.
181 Full code here: [nbody.perl-2a.perl](https://github.com/rurban/shootout/commit/c35bb85ed84941157eb01b7ca844d3b4472e0df3) 181 Full code here: [nbody.perl-2a.perl](https://github.com/rurban/shootout/commit/c35bb85ed84941157eb01b7ca844d3b4472e0df3)
182 182
183 I also tried a generic `unroll_loop()` function, but it was a bit too 183 I also tried a generic `unroll_loop()` function, but it was a bit too
184 unstable finding the end of the loop blocks on the source level, and 184 unstable finding the end of the loop blocks on the source level, and
185 `qv()` looked good enough. The compiler can use the optree to find the 185 `qv()` looked good enough. The compiler can use the optree to find the
186 optimization. 186 optimization.
187 187
188 188
189 Types and autovivification 189 Types and autovivification
190 -------------------------- 190 --------------------------
191 191
192 A naive optimization would check the index ranges beforehand, and access 192 A naive optimization would check the index ranges beforehand, and access
193 the array values directly. Something the type optimizer for arrays would 193 the array values directly. Something the type optimizer for arrays would
194 do. 194 do.
195 195
196 my (num @xs[4], num @ys[4], num @zs[4]); 196 my (num @xs[4], num @ys[4], num @zs[4]);
197 my (num @vxs[4], num @vys[4], num @vzs[4]); 197 my (num @vxs[4], num @vys[4], num @vzs[4]);
198 my num @mass[4]; 198 my num @mass[4];
199 199
200 And instead of `$xs[0] * $xs[1]` which compiles to 200 And instead of `$xs[0] * $xs[1]` which compiles to
201 AELEMFASTs, currently inlined by B::CC to: 201 AELEMFASTs, currently inlined by B::CC to:
202 202
203 { AV* av = MUTABLE_AV(PL_curpad[6]); 203 { AV* av = MUTABLE_AV(PL_curpad[6]);
204 SV** const svp = av_fetch(av, 0, 0); 204 SV** const svp = av_fetch(av, 0, 0);
205 SV *sv = (svp ? *svp : &PL_sv_undef); 205 SV *sv = (svp ? *svp : &PL_sv_undef);
206 if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv); 206 if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv);
207 PUSHs(sv); 207 PUSHs(sv);
208 } 208 }
209 { AV* av = MUTABLE_AV(PL_curpad[6]); 209 { AV* av = MUTABLE_AV(PL_curpad[6]);
210 SV** const svp = av_fetch(av, 1, 0); 210 SV** const svp = av_fetch(av, 1, 0);
211 SV *sv = (svp ? *svp : &PL_sv_undef); 211 SV *sv = (svp ? *svp : &PL_sv_undef);
212 if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv); 212 if (SvRMAGICAL(av) && SvGMAGICAL(sv)) mg_get(sv);
213 PUSHs(sv); 213 PUSHs(sv);
214 } 214 }
215 rnv0 = POPn; lnv0 = POPn; /* multiply */ 215 rnv0 = POPn; lnv0 = POPn; /* multiply */
216 d30_tmp = lnv0 * rnv0; 216 d30_tmp = lnv0 * rnv0;
217 217
218 It should compile to: 218 It should compile to:
219 219
220 d30_tmp = (double)AvARRAY(PL_curpad[6])[0] * 220 d30_tmp = (double)AvARRAY(PL_curpad[6])[0] *
221 (double)AvARRAY(PL_curpad[6])[1]; 221 (double)AvARRAY(PL_curpad[6])[1];
222 222
223 With the size declaration you can omit the `av_fetch()` call and undef 223 With the size declaration you can omit the `av_fetch()` call and undef
224 check ("autovivification"), with the type `num` you do not need to get 224 check ("autovivification"), with the type `num` you do not need to get
225 to the `SvNV` of the array element, the value is stored directly, and 225 to the `SvNV` of the array element, the value is stored directly, and
226 the type also guarantees that there is no magic to be checked. So 226 the type also guarantees that there is no magic to be checked. So
227 `AvARRAY(PL_curpad[6])[0]` would return a double. 227 `AvARRAY(PL_curpad[6])[0]` would return a double.
228 228
229 And the stack handling (PUSH, PUSH, POP, POP) can also be optimized 229 And the stack handling (PUSH, PUSH, POP, POP) can also be optimized
230 away, since the ops are inlined already. That would get us close to 230 away, since the ops are inlined already. That would get us close to
231 an optimizing compiler as with Haskell, Lua, PyPy or LISP. Not close 231 an optimizing compiler as with Haskell, Lua, PyPy or LISP. Not close
232 to Go or Java, as their languages are stricter. 232 to Go or Java, as their languages are stricter.
233 233
234 I tried a simple B::CC AELEMFAST optimization together with "no autovivification" 234 I tried a simple B::CC AELEMFAST optimization together with "no autovivification"
235 which does not yet eliminate superfluous PUSH/POP pairs but could be applied 235 which does not yet eliminate superfluous PUSH/POP pairs but could be applied
236 for typed arrays and leads to another 2x times win. 236 for typed arrays and leads to another 2x times win.
237 237
238 2.80s down to 1.67s on a slower PC with N=50,000. 238 2.80s down to 1.67s on a slower PC with N=50,000.
239 239
240 Compiled to *(perlcc /2a)*: 240 Compiled to *(perlcc /2a)*:
241 241
242 PUSHs(AvARRAY(PL_curpad[6])[0])); 242 PUSHs(AvARRAY(PL_curpad[6])[0]));
243 PUSHs(AvARRAY(PL_curpad[6])[1])); 243 PUSHs(AvARRAY(PL_curpad[6])[1]));
244 rnv0 = POPn; lnv0 = POPn; /* multiply */ 244 rnv0 = POPn; lnv0 = POPn; /* multiply */
245 d30_tmp = rnv0 * lnv0; 245 d30_tmp = rnv0 * lnv0;
246 246
247 Without superfluous PUSH/POP pairs I suspect another 2x times win. But this 247 Without superfluous PUSH/POP pairs I suspect another 2x times win. But this
248 is not implemented yet. With typed arrays maybe another 50% win, and we don't 248 is not implemented yet. With typed arrays maybe another 50% win, and we don't
249 need the no autovivification overhead. 249 need the no autovivification overhead.
250 250
251 It should look like *(perlcc /2b)*: 251 It should look like *(perlcc /2b)*:
252 252
253 rnv0 = SvNV(AvARRAY(PL_curpad[6])[0]); 253 rnv0 = SvNV(AvARRAY(PL_curpad[6])[0]);
254 lnv0 = SvNV(AvARRAY(PL_curpad[6])[1]); 254 lnv0 = SvNV(AvARRAY(PL_curpad[6])[1]);
255 d30_tmp = rnv0 * lnv0; /* multiply */ 255 d30_tmp = rnv0 * lnv0; /* multiply */
256 256
257 I'm just implementing the check for the 'no autovivification' pragma and 257 I'm just implementing the check for the 'no autovivification' pragma and
258 the stack optimizations. 258 the stack optimizations.
259 259
260 Summary 260 Summary
261 ------- 261 -------
262 262
263 [u64q nbody](http://shootout.alioth.debian.org/u64q/performance.php?test=nbody) 263 [u64q nbody](http://shootout.alioth.debian.org/u64q/performance.php?test=nbody)
264 264
265 Original numbers with N=50,000,000: 265 Original numbers with N=50,000,000:
266 266
267 * Fortran 14.09s 267 * Fortran 14.09s
268 * C 20.72s 268 * C 20.72s
269 * Go 32.11s 269 * Go 32.11s
270 * SBCL 42.75s 270 * SBCL 42.75s
271 * JRuby 8m 271 * JRuby 8m
272 * PHP 11m 272 * PHP 11m
273 * Python 3 16m 273 * Python 3 16m
274 * Perl 23m 274 * Perl 23m
275 * Ruby 1.9 26m 275 * Ruby 1.9 26m
276 276
277 My numbers with N=50,000,000: 277 My numbers with N=50,000,000:
278 278
279 * Perl 22m14s 279 * Perl 22m14s
280 * Perl 1 21m48s (inline sub advance, no ENTERSUB/LEAVESUB) 280 * Perl 1 21m48s (inline sub advance, no ENTERSUB/LEAVESUB)
281 * perlcc 9m52s 281 * perlcc 9m52s
282 * Perl 2 14m13s (unrolled loop + AELEM => AELEMFAST) 282 * Perl 2 14m13s (unrolled loop + AELEM => AELEMFAST)
283 * perlcc 2 7m11s 283 * perlcc 2 7m11s
284 * perlcc 2a 4m52s (no autovivification, 4.5x faster) 284 * perlcc 2a 4m52s (no autovivification, 4.5x faster)
285 * perlcc 2b ? (~2m30) (no autovivification + stack opt) 285 * perlcc 2b ? (~2m30) (no autovivification + stack opt)
286 * perlcc 2c ? (~1m25s) (typed arrays + stack opt) 286 * perlcc 2c ? (~1m25s) (typed arrays + stack opt)
287 287
288 Continued at [part 4](http://blogs.perl.org/users/rurban/2012/10/optimizing-compiler-benchmarks-part-4.html)
289 with stack and nextstate optimisations. But I only reached 3m30s, not 2m30s so far.
Powered by Google Project Hosting