|
DesignGuideIntro
blacc Design Guide
Phase-Design IntroductionThere is fairly little excuse for writing a parser generator in this day and age. For real production work, yacc and lex (and their free, open-source descendants) have been "good enough" for decades. Throw in some support for non-C/C++ languages, and surely a handful of parser generators can reasonably satisfy 99.9% of the world's needs. Yet a quick Google search shows an astounding number of extant parser generators of all shapes, sizes, and colors. Even the corpses of many long-dead generators have been revived enough to be posted on the Internet. One wonders if there may not be more authors of parser generators than there are actual users. So the world surely does not need one more. And yet, here we are, so what little excuse there is, I must try to make. Motivation/HistoryI first wrote my own LL(1) parser generator some decades ago. Like many before me, I thought the sensible thing was to start with an LL(1) parser and then tweak to make up for its inherent annoyances (awful expression grammars, lack of left recursion, etc.). I wasn't terribly happy with the result (nor did it contain anything that could be viewed as an innovation). But it gave me a basis for thinking about the problem on and off again over the years. I started a new run at the problem by first just imagining how I would like to specify operators in a grammar without adding too much cruft to the basic yacc syntax. That eventually led me to adopt a syntax stolen from my dog-eared copy of the wonderfully concise Standard C quick reference guide, by P.J. Plauger & Jim Brodie. The syntax I envisioned let me have all operator precedence and associativity spelled out concisely in one place (even in the face of operator homonyms, such as unary versus binary '-'), at the modest of expense of a percentage of duplication in the actual grammar (though no duplication of precedence/associativity information). The second problem was the desire to specify expressions naturally in the grammar, which necessarily involves both ambiguity and left-recursion, a problem for the LL(1) algorithm. Initially, I reached for (and implemented!) operator precedence parsing to handle just the expression part of the grammar (a very old solution, though not one that survives in most LL(1) parser generators today). However, as I explored just how much I could push the basic operator precedence algorithm and still hope for correctness in the generated code, I began to tinker with using LR parsing to solve the problem that operator precedence only identifies a handle, not a specific production. Along about this time, Intel CPUs stopped getting faster, and instead began turning to increasing the number of available cores. They had already left memory speeds so far behind that cache misses tended to wreck any hope that single-threaded software would run noticeably faster on a new machine than an old one. The more things change, the more they remain the same, and I began to appreciate that designing code to be tiny could come back into vogue -- if your code fits completely into L1 cache, it can run really fast. And even though Intel CPU clock speeds have topped out, staying within L1 cache is also a great advantage for multi-CPU code, as the hardware has to do horrible (and slow) things to synchronize memory between multiple processors. These three things (operator precedence, LR parsing, generating tiny parsers) came together in my mind to make me see that it might be possible to greatly reduce the space required for parsing expressions by treating that part of the grammar as LR(0), and using operator precedence tables to resolve ambiguities. In terms of conciseness, this has the fairly pleasing effect of allowing one to handle a very large expression syntax (such as seen in C/C++) with a tiny grammar like this: /*...*/
%left X++ X-- X[X] X.X X->X
%right ++X --X &X *X +X -X ~X !X
%left X*X X/X X%X
%left X+X X-X
/* and so on */
%%
/*...*/
E >: E @ E
| @ E
| E @
| '(' E ')'
/* and so on */
;The "@" is a reference to "any operator that fits here from your previous declaration of precedence/associativity". In this style, the user provides, for example, a single action that handles all binary operators, using the token identified with the operator itself to discriminate the cases. This seems to me to be a nearly minimal syntax for specifying typical operator portions of a grammar so as to combine precedence/associativity information with the standard BNF syntax. By constructing multiple operator precedence tables (two sets of f()/g() would suffice for most grammars) and combining with the LR(0) construction, operator homonyms (e.g., unary '-' versus binary '-') can be automatically handled without any further tweaking by the user, apart from the original operator precedence/associativity declarations. Of course, generating code for three different parsing algorithms is at odds with the goal of minimizing the generated code size. This last unpleasantness was removed when I realized that I could generate a parser based on a byte-code interpreter rather than table lookup/search. This is another old solution from the days when code size mattered; see the venerable Coco project, as well documented in A Compiler Generator for Microcomputers, by Rechenberg and Mössenböck. By generating byte-codes for an interpreter, supporting multiple parsing algorithms became a matter of slightly expanding the number of opcodes. This also appears to be an easier structure (for me) to understand and inspect. A look through the generated code of a yacc parser shows you the ugliness that tends to crop up once the basic parser algorithm has to be tweaked to handle a half-dozen real-life issues (error recovery, talking to the lexer, table compression schemes, etc.). If the generated parser plus user actions can fit in L1 cache, the loss of speed due to using an interpreter instead of table lookup might be more than made up for. YMMV, etc. One additional matter was that of left recursion. There are ancient and standard algorithms for removing left recursion from grammars to make them LL(1). However, the problem still annoyed me. You can remove the left recursion. You can make a grammar that reduces the recursive items either forward or backward. But you can't get rid of the fact that you've ended up with a right-recursive grammar that needlessly chews up parser stack space. The classic gotcha is array initialization. It is not at all unheard of for people to write programs to generate enormous static array initializations in languages such as C. For example, one might write a program that reads a bitmap and then generates its data in the form of a C array. In this case, a right-recursive parse either overflows the parsing stack, or (if the parsing stack grows dynamically) uses up a huge amount of parsing stack for no good reason. I eventually realized (duh!) that the issue is simply one of optimizing out tail recursion, replacing a CALL with a GOTO, effectively. This is an optimization that can be applied to any right-recursive production if there is no action following the right-most symbol. In right-recursive situations resulting from transforming out left recursion, that will always be the case. Modularizing Code GenerationOne reason for the proliferation of home-grown parser generators is simply a desire to control the code generated for the parser. Often these tools are using quite standard algorithms (LL, LR, LALR, etc.), but primarily differ in how they generate code. For example, they may support a different output language, provide different error recovery, support a different interface to the lexer, etc. A parser generator should really modularize the code generation phase so that modifying (or completely changing) the generated code requires much less effort than writing your own parser generator from scratch. Degeneralizing ToolsOne peripheral motivation for tinkering with blacc came from my experience with periodically (re)writing my own make tools. The original Unix make was a highly general dependency processor, inherently independent of any particular tool (compiler, linker, etc.). This was a good choice at the start of a tool's evolution, but practice showed that 99% of make's usage was with compilers, linkers, and librarians, and mostly C. So, over time, the attractiveness of keeping make's design a safe modular distance from being tied to particular tools became less attractive. For example, from a conceptual viewpoint, it was better to insist that C compilers should all offer the ability to generate header file dependency lists for use with make, since they are in the best position to correctly identify the dependencies. In practice, however, programmers wanted make to "just work" in the overwhelmingly common case of building C programs, and now most modern build utilities simply scan C source code themselves to automatically generate dependencies rather than leaving the user at the mercy of each particular C toolchain. Likewise, for parser generators yacc's LALR(1) algorithm is fairly general and efficient. But a whole lot of parsers that get generated are for C-like languages that have many things in common (rich expression syntaxes, for example), and yacc does little to accomodate such common cases (e.g., the advice for expressions requires at least some understanding of LR algorithms to follow, even though most expression syntaxes are highly similar). Just as build utilities have evolved towards less general features that address very common usage situations, I think there is some room for parser generators to do the same. In particular, since most languages I am interested in implementing will have a C-like expression syntax, I want it to be dead simple and compact to specify that part of the grammar. SummaryThus, blacc fills an unmet need for tiny parsers. That's my story and I'm sticking to it. In any case, these ideas gave me a pleasing enough design that I was willing to invest the time to try to build it. I would claim that some of these ideas are novel, but that is always a dangerous claim to make in the parser generator field. So many people have written so many of them that there's no way to ascertain that someone else hasn't had an identical idea (no matter how outlandish it is) before you. | |