|
|
Creating this issue to spec out language tagging requirements.
The feature would allow some way of specifying the language of the block so
that it can be tokenized and highlighted appropriately.
CURRENTLY
=========
There are 2 lexers: one for C-style languages, and one for markup languages.
The C-style lexer does a decent job on some of the most commonly used
languages (incl. python and bash, but excl. lisps and basics), and the
markup one handles XML, HTML, and various HTML-like templating languages.
The current lexing scheme allows descent into tokens with a different lexer.
REASONS FOR
===========
We do not handle other languages, notably VB, Perl, OCAML. And cannot
without significant work. Determining the language of a snippet is hard to
do, and if we do it wrong it would make the library less reliable/useful
for those languages it currently supports.
The keyword list for C-style languages is a union of the keywords from all
the languages I've tested with. It misidentifies as keywords some tokens,
e.g. "template", that are not keywords in many languages.
Some languages (java) have consistently observed naming conventions that
distinguish types, fields, locals, and constants. Those conflict with
common naming conventions in e.g. C++.
REASONS AGAINST
===============
(1) Bloats code. Due to lists of keywords and code for languages not used.
Could be mitigated by some kind of inheritance of definitions, or by
splitting into files.
(2) Complexity of install. Mitigating (1) by splitting into multiple files
would make it harder to install. Currently there is only one file to deal
with.
(3) Complexity of use. Currently the API is very simple. Could mitigate
by falling back to the existing behavior if no lang specified.
GOAL
====
Provide optional language tagging without bloating code. Preference is
given to simplicity of use, so we will retain the one file to install property.
DESIGN
======
The current scheme is complicated by the fact that we highlight around
tags, so that if the source includes links around class names, those are
preserved in the prettified output.
Instead of preserving those in stream as first-class tokens, we will
extract those out, keeping their position in the original stream so they
can be reinserted later.
This will let us eliminate the current state machines which take a lot of
code, in favor of regular expressions.
We can inherit keyword lists by using one keyword list as the prototype of
another.
|