Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for expanding HTML5 named character references #116

Closed
icedtoast opened this issue Sep 20, 2016 · 2 comments
Closed

Add support for expanding HTML5 named character references #116

icedtoast opened this issue Sep 20, 2016 · 2 comments

Comments

@icedtoast
Copy link

Add support for expanding the HTML5 named character references, for example  

The full list can be found here: https://www.w3.org/TR/html5/syntax.html#named-character-references
and a JSON form can also be found on that page.

@zeux
Copy link
Owner

zeux commented Sep 20, 2016

This is outside of the scope of pugixml and will not happen. I recommend pre- or post-processing the data if you need to support HTML5 character references (pre-processing would involve expanding the HTML5 references in the document prior to parsing, post-processing would involve expanding the individual value strings as you read them from the document after parsing - correct post-processing involves using parse_no_escapes so that you can unescape the values manually).

There are three issues with supporting this:

  1. pugixml is not an HTML parser; there is a number of missing features required to correctly parse HTML documents, as outlined in Allow parsing of html  #106. The value of adding specifically named reference expansion is probably lower than adding any other feature because it's easy to work around its absence
  2. The HTML character reference table is HUGE. pugixml parser core (sufficient to parse an XML document and pretty-print it to stdout) compiles to ~40 kB on Linux x64. I tried to make a reasonably size-efficient encoding that can be quickly searched through and the tables alone add up to ~67 kB. This is a non-starter as a feature that's compiled in. Obviously it can be compiled out by default, but at this point a pre-processor works just as well.
  3. pugixml has a fundamental design assumption that the document is parseable inplace. For this assumption to hold any reference expansion has to result in a byte sequence that is not longer than the input. Two named references out of the HTML table, ≪⃒ and ≫⃒, expand to 6 bytes when encoded using UTF-8, which makes it impossible to parse some documents with these references.

@zeux
Copy link
Owner

zeux commented Sep 20, 2016

Here's the aforementioned size-efficient encoding: https://gist.github.com/zeux/0c12a521b1c10637a2179126f1688782

This is not necessarily very performant; it'd probably be beneficial to replace the first phase lookup with a switch. Regardless, the tables are too big, and two of the references can't even be parsed in all cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants