In the [first part](http://blogs.perl.org/users/rurban/2012/09/optimizing-compiler-benchmarks-part-1.html) I showed some problems and possibilities of the B::C compiler and B::CC optimizing compiler with an example which was very bad to optimize, and promised for the next day an improvement with "stack smashing", avoiding copy overhead between the compiler stacks and global perl data.
But the next days I went to Austin to meet with the [perl11.org](http://perl11.org/) group, which has as one of the goals of an optimizing compiler for perl5, and to replace all three main parts of perl, the parser, the compiler/optimizer and the vm (the runtime) at will. You can do most of it already, esp. replace the runloop, but the 3 parts are too intermingled and undocumented.
So I discussed the "stack smashing" problem with Will and my idea on the solution.
1. The "stack smashing" problem
B::CC keeps two internal stacks to be able to optimize arithmetic and boolean operations on numbers, int IV and double NV.
The first stack, called "stack", keeps the perl operand stack, the arguments for each push/pop runtime pp\_* function.
The perl AST, called optree, is a bit disfunctional, as it not able to generate optimized variants on the operand types. So e.g there are integer optimized versions, use with "use integer", e.g. i\_add variant for the add operator, which are used if both operands are known to be integers or integer constants at compile-time. There are no variants for strict NV, and most important, there are no variants for degenerated arguments, arguments with magic. Because you can add magic at run-time, which the compiler (op.c) does not know about (*which is super lame*), all argument types need to be checked at run-time, and you always have to go the slow general path.
This B::CC stack, implemented in B::Stackobj keeps track of the types during the lifetime and can optimize and pessimize the used type of each stack variable. It can esp. optimize on bool, int and double, needs to exchange values on stringification and pessimizes on magic, esp. tie (a variable) and overload (an op).
There is no nice picture in [illguts](http://search.cpan.org/dist/illguts/) describing operand stacks.
The second stack holds lexicals. For all perl lexical variables the same is done, the B::Stackobj::Padsv lexical stack mimics every function's PADLIST. Contrary to the stack, the access to this list is optimized by the compiler (op.c) already at compile-time in the PL_curpad array, since it defines lexicals variables, which are fully known at compile-time. So each op knows the exact index into the padlist, stored in the fields "targ" or "padix". The types are dynamic as on the stack, but there is an additional list, the comppadnames, which holds optional type information, essentially a pointer to packagename for each typed variable. It is just unused yet.
See illguts for a [picture](http://cpansearch.perl.org/src/RURBAN/illguts-0.42/index.html#stack).
B::Stackobj::Padsv can use the perl type information for the packages int and double, but not yet for number and bool or CTypes nor Moose types. So `my int $i` already specifies an optimized integer, as an IV during the scope of `use integer`.
Specifying types via perl attributes, like `my $i :int;` would be nice also but is technically impossible, since there is no compile-time method for attributes to be checked, only at run-time.
`B::CC` keeps a list of good ops (pp_ functions), where the type or at least the number of arguments or return types is known at compile time. The same information is also available in the [Opcodes](http://search.cpan.org/dist/Opcodes/) module on CPAN.
Defined are op lists like, no\_stack, skip\_stack, skip\_lexicals, skip\_invalidate, need\_curcop and for various other predefined op types needed for other optimizers or the Jit module. Does it branch, i.e. will it always return op->next or not?
Does it need to call PERL\_ASYNC\_CHECK?
On all unknown ops or ops which need to access lexicals the current internal B::Stackobj::Padsv lexical stack need to refreshed, written back from the internal compiler stack to the actual values on the heap, which is `PL_curpad[ix].` (sub write\_back\_lexicals). The same must be done for all stack variables which need to accessed by the next op (sub write\_back\_stack). Just not for ops, which does not access stack variables.
So there is a lot of theoretical copying - "stack smashing" - going on.
But B::CC is cleverly keeping track of the stack areas which need to be written back, so in practice only the really needed values are handled.
In practice only numeric and boolean optimizations operate on private c variables on the C stack, rather than on referenced heap values, either on the perl stack or in the curpad.
So only on fully typed numbers we need to copy the values back and force, before and after, as B::CC inlines most of these ops.
I'll take a benchmark in perl is very slow compared to other scripting
languages, and which does a lot of arithmetic. Because I expect the
B::CC type optimizer to kick in, unboxing all the numbers, and inling
So we get a **2x times faster run-time**, with a little bit of different results and a lot of interesting command line options.
**--time** prints the B::CC time as 'c time', the gcc and ld time as 'cc time', and the run-time as 'r time'. 0.625202s vs. 1.305s in pure perl. Even gcc plus ld with -OS is faster than perl. And B::CC's optimizer is also real fast here in this simple example.
**-r** runs the compiled program with the rest of the perlcc arguments
**-O** compiles with B::CC, not with the non-optimizing B::C
**-S** keeps the C source, to be able to inspect the generated optimized C code.
**-O1** adds some minor B::CC optimizations. -O2 is pretty unstable yet, and B::CC proper (-O0) already adds all B::C -O3 optimizations.
**--Wb** defines further B::CC options
**-fno-destruct** is a B::C option to skip optree destruction at the very end. It does thread, IO and object destruction in proper order, and of course does object destruction during run-time, but we do not care of memory leaks with normal executables. Process termination does it better and faster than perl. Even daemons are safe to be compiled with -fno-destruct, just not shared libraries.
-U defines packages to be **unused**. warnings, B, Carp are notorious compiler packages, which are innocently being pulled in, even if you do not use or call them.
B is used by the compiler itself, and since the B maintainer does a terrible job helping the B compiler modules, we have to manually force B get out of our way. warnings and Carp are also magically pulled in by some dependent core modules and cause a lot of startup and memory overhead. These 3 packages are easily skipped with simple programs or benchmarks, in the real world you have to live with multi-megabyte compiled programs. This reflects the reality of the memory perl uses during run-time.
E.g. without -U the numbers are:
Prescan 1 packages for unused subs in main::
Saving unused subs in main::
old unused: 1, new: 1
no %SIG in BEGIN block
Total number of OPs processed: 193
NULLOP count: 0
bootstrapping DynaLoader added to xs_init
no dl_init for B, not marked
script/perlcc: c time: 0.192175
script/perlcc: cc time: 1.3049
script/perlcc: r time: 0.642252
**-DOscpSqlm** are some debugging options, which add interesting information into the generated C code. B::CC adds debugging output as comments into the C code, to be able to inspect the optimizer result, B::C prints debugging output to STDOUT.
Let's have a look into the generated C code.
# The Computer Language Shootout
# contributed by Christoph Bauer
# converted into Perl by Márton Papp
# fixed and cleaned up by Danny Sauer
# optimized by Jesse Millikan
use constant PI => 3.141592653589793;
use constant SOLAR_MASS => (4 * PI * PI);
use constant DAYS_PER_YEAR => 365.24;
# Globals for arrays... Oh well.
# Almost every iteration is a range, so I keep the last index rather than a count.
my (@xs, @ys, @zs, @vxs, @vys, @vzs, @mass, $last);
my ($dt) = @_;
my ($mm, $mm2, $j, $dx, $dy, $dz, $distance, $mag);