My favorites | Sign in
Google
                
New issue | Search
for
| Advanced search | Search tips
Issue 17: Language tagging
6 people starred this issue and may be notified of changes. Back to list
Status:  Fixed
Owner:  mikesamuel
Closed:  Jul 2008
Type-Defect
Priority-Medium


Sign in to add a comment
 
Reported by mikesamuel, Aug 14, 2007
Creating this issue to spec out language tagging requirements.

The feature would allow some way of specifying the language of the block so
that it can be tokenized and highlighted appropriately.


CURRENTLY
=========
There are 2 lexers: one for C-style languages, and one for markup languages.

The C-style lexer does a decent job on some of the most commonly used
languages (incl. python and bash, but excl. lisps and basics), and the
markup one handles XML, HTML, and various HTML-like templating languages.

The current lexing scheme allows descent into tokens with a different lexer.


REASONS FOR
===========
We do not handle other languages, notably VB, Perl, OCAML.  And cannot
without significant work.  Determining the language of a snippet is hard to
do, and if we do it wrong it would make the library less reliable/useful
for those languages it currently supports.

The keyword list for C-style languages is a union of the keywords from all
the languages I've tested with.  It misidentifies as keywords some tokens,
e.g. "template", that are not keywords in many languages.

Some languages (java) have consistently observed naming conventions that
distinguish types, fields, locals, and constants.  Those conflict with
common naming conventions in e.g. C++.


REASONS AGAINST
===============
(1) Bloats code.  Due to lists of keywords and code for languages not used.
 Could be mitigated by some kind of inheritance of definitions, or by
splitting into files.

(2) Complexity of install.  Mitigating (1) by splitting into multiple files
would make it harder to install.  Currently there is only one file to deal
with.

(3) Complexity of use.  Currently the API is very simple.  Could mitigate
by falling back to the existing behavior if no lang specified.


GOAL
====
Provide optional language tagging without bloating code.  Preference is
given to simplicity of use, so we will retain the one file to install property.

DESIGN
======
The current scheme is complicated by the fact that we highlight around
tags, so that if the source includes links around class names, those are
preserved in the prettified output.

Instead of preserving those in stream as first-class tokens, we will
extract those out, keeping their position in the original stream so they
can be reinserted later.

This will let us eliminate the current state machines which take a lot of
code, in favor of regular expressions.

We can inherit keyword lists by using one keyword list as the prototype of
another.

Comment 1 by mikesamuel, Aug 14, 2007
Language tags should be easy to recognize and remember.

Since we use class="prettyprint" to identify regions to prettyprint, I suggest the
following convention

class="prettyprint"  -- make a best guess as to language
class="prettyprint lang-java"  -- do java prettyprinting

The "lang-" prefix is followed by the filename extension commonly used for source
files in that language to avoid problems with C# not being a valid html identifier. 
We will use cc for C++ since it is an identifier, and more commonly used than cpp or cxx.
Comment 2 by mikesamuel, Aug 15, 2007
To flesh out the high level design, the prettify loop will be changed to:
(1) Extract tags and store [tag, position-in-string]
(2) Use a regex based lexer to lex the string sans tags
(3) Run a classifier over tokens
(4) Merge tags back into token list and join tokens to produce html
from the current
(1) Split into chunks of tags | text
(2) Split text chunks into tokens using a state machine over a character iterator
that unescapes entities lazily
(3) Join token list to produce html



This will cut out the hand coded state machines that iterate over characters,
replacing them with the regex based lexers from 2.

We can then define a language handler as a { lexer, classifier } pair.

Define a language handler for C-style langs and one for markup langs to get us
backwards compatible.

Modify the main prettify function to look for a lang-\w+ class, and, if present,
choose the appropriate lexer.

Implement a lisp/scheme lexer to demonstrate that new handlers can be added and document.

Implement other lexers as demanded.
Comment 3 by mikesamuel, Aug 31, 2007
Finished rewriting the existing lexers to use PR_createSimpleLexer which is regexp based.
Comment 4 by partdavid, Feb 07, 2008
I realize this would be an entirely different thing, but what about taking advantage
of a library of pre-written syntax highlighting rules, like VIM's? The
syntax-defining commands aren't that complicated. (Well, they don't seem to be, what
do I know?)
Comment 5 by mikesamuel, Jul 04, 2008
@r38
Status: Fixed
Sign in to add a comment