Add support for expanding HTML5 named character references #116

icedtoast · 2016-09-20T03:28:11Z

Add support for expanding the HTML5 named character references, for example  

The full list can be found here: https://www.w3.org/TR/html5/syntax.html#named-character-references
and a JSON form can also be found on that page.

The text was updated successfully, but these errors were encountered:

zeux · 2016-09-20T07:58:45Z

This is outside of the scope of pugixml and will not happen. I recommend pre- or post-processing the data if you need to support HTML5 character references (pre-processing would involve expanding the HTML5 references in the document prior to parsing, post-processing would involve expanding the individual value strings as you read them from the document after parsing - correct post-processing involves using parse_no_escapes so that you can unescape the values manually).

There are three issues with supporting this:

pugixml is not an HTML parser; there is a number of missing features required to correctly parse HTML documents, as outlined in Allow parsing of html #106. The value of adding specifically named reference expansion is probably lower than adding any other feature because it's easy to work around its absence
The HTML character reference table is HUGE. pugixml parser core (sufficient to parse an XML document and pretty-print it to stdout) compiles to ~40 kB on Linux x64. I tried to make a reasonably size-efficient encoding that can be quickly searched through and the tables alone add up to ~67 kB. This is a non-starter as a feature that's compiled in. Obviously it can be compiled out by default, but at this point a pre-processor works just as well.
pugixml has a fundamental design assumption that the document is parseable inplace. For this assumption to hold any reference expansion has to result in a byte sequence that is not longer than the input. Two named references out of the HTML table, &nLt; and &nGt;, expand to 6 bytes when encoded using UTF-8, which makes it impossible to parse some documents with these references.

zeux · 2016-09-20T08:21:03Z

Here's the aforementioned size-efficient encoding: https://gist.github.com/zeux/0c12a521b1c10637a2179126f1688782

This is not necessarily very performant; it'd probably be beneficial to replace the first phase lookup with a switch. Regardless, the tables are too big, and two of the references can't even be parsed in all cases.

zeux closed this as completed Sep 20, 2016

zeux added enhancement wontfix labels Sep 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for expanding HTML5 named character references #116

Add support for expanding HTML5 named character references #116

icedtoast commented Sep 20, 2016

zeux commented Sep 20, 2016

zeux commented Sep 20, 2016

Add support for expanding HTML5 named character references #116

Add support for expanding HTML5 named character references #116

Comments

icedtoast commented Sep 20, 2016

zeux commented Sep 20, 2016

zeux commented Sep 20, 2016