What's new? | Help | Directory | Sign in
Google
textareaplus
A browser compatible JavaScript based text editor
  
  
  
  
    
Search
for
Updated Mar 20, 2007 by drSlump80
Labels: Phase-Design
Lexer  
Documentation about the lexer

Introduction

As a mean to support advanced features like syntax highlighting, code completion, brace matching... there must be a way to analyze the text and split the contents into smaller units which offer a bit more of information. To perform this action we use a simple lexer which is able to scan the text and according to a set of rules divide it into smaller chunks known as lexems or tokens.

A lexer is theoretically composed of two parts: a scanner and a evaluator. However in this implementation we just use the scanner so leaving the evaluator part to be implemented in a higher level, for example, in a parser.

Due to the interpreted nature of Javascript and given that the lexer must be as fast as possible to be used in real time (interactively) we couldn't base it on traditional designs, which take a char at a time and see if it matches certain rules, implementing in fact a Finite State Machine. Instead we take advantage of Regular Expression which are a native Javascript object thus very fast compared to interpreted scanning algorithms.

A Regular Expression is a sequence of characters, control codes and mathematical operators which allow to describe a set of strings. The regular expression once interpreted by its engine is transformed into a Finite State Machine, with most modern engines (Perl compatible) able to do backtracking, allowing to describe pretty complex string constructs easily. We have explained above that a lexer scanner is typically implemented as a Finite State Machine, so we can handle a regular expression in a specific way to take advantage of that. Given that they are translated into machine code (compiled) they are the faster alternative for this job while at the same time are flexible enough for most cases.

How it works

Most regular expression engines support grouping in some way or another. The ones based on the Perl syntax are specially strong in this feature. The engines used in browsers support a feature known as capturing parenthesis which allow to describe a pattern inside an expression and return the actual match of that pattern. This is interesting because by abusing this feature we can build a Finite State Machine over the regular expression itself, thus making them much more flexible and powerful while retaining its speed as text scanners.

So lets put an example to clarify this technique. Lets say we want to scan a simple language like the one commonly used in .ini files, where we can have the following lexems: Ident, Value, Comment. This syntax has the following constraints:

Now that we have explained and understood the syntax we can create a regular expression for each one of the lexems.

If we used each one of those regular expressions against the text they would give up the results. So we need to build a system which makes all the checks at the same time and tells us which pattern matched and the actual text it matched. To accomplish this we use capturing parenthesis and the alternation operator '|'.

/([A-Za-z_]+)|(\=.*?$)|(^#.*?$)/

if we execute that expression against the text the regular expressions engine will return us an array with the results of the first match it founds. That array will contain the actual text matched in the 0 position. Moreover, since we have used capturing parenthesis it will have allocated the result of each one of them in that array starting at the index 1.

So having the following text:

foo = bar
#x = y

Applying the expression repeatedly we get the following results:

0 1 2 3
foo foo
= bar = bar
#x = y #x = y

For each match (row in the above table) we can associate it with a lexem. Take a look at the following code to see how it works.

// define the text to match against
src = 'foo = bar\n\
#x = y';

// define the regular expression being used (multi-line and global modifiers)
re = /([A-Za-z_]+)|(\=*?$)|(^#.*?$)/gm;

// define the lexems identifiers in the order their patterns are used
lexems = [ 'ident', 'value', 'comment' ];

// find all the matches
while ( match = re.exec(src) )
{
  // loop thru the defined patterns...
  for (i=0; i<lexems.length; i++)
  {
    //...to find the one which actually matched (the match array is 1 based)
    if (match[ 1+i ].length)
    {
      alert('Lexem "' + lexems[i] + '" found with text "' + match[ 1+i ] + '"');
    }
  }
}

That is the basic algorithm used to scan the text in the search of lexems. The actual code is a bit more complicated since we have the need for states, which are nothing more than to switch the set of patterns or rules when an specific lexem is found. This allows, for example, to parse properly PHP's double-quote strings which can have embedded variables in them.


Sign in to add a comment