You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is outside of the scope of pugixml and will not happen. I recommend pre- or post-processing the data if you need to support HTML5 character references (pre-processing would involve expanding the HTML5 references in the document prior to parsing, post-processing would involve expanding the individual value strings as you read them from the document after parsing - correct post-processing involves using parse_no_escapes so that you can unescape the values manually).
There are three issues with supporting this:
pugixml is not an HTML parser; there is a number of missing features required to correctly parse HTML documents, as outlined in Allow parsing of html #106. The value of adding specifically named reference expansion is probably lower than adding any other feature because it's easy to work around its absence
The HTML character reference table is HUGE. pugixml parser core (sufficient to parse an XML document and pretty-print it to stdout) compiles to ~40 kB on Linux x64. I tried to make a reasonably size-efficient encoding that can be quickly searched through and the tables alone add up to ~67 kB. This is a non-starter as a feature that's compiled in. Obviously it can be compiled out by default, but at this point a pre-processor works just as well.
pugixml has a fundamental design assumption that the document is parseable inplace. For this assumption to hold any reference expansion has to result in a byte sequence that is not longer than the input. Two named references out of the HTML table, ≪⃒ and ≫⃒, expand to 6 bytes when encoded using UTF-8, which makes it impossible to parse some documents with these references.
This is not necessarily very performant; it'd probably be beneficial to replace the first phase lookup with a switch. Regardless, the tables are too big, and two of the references can't even be parsed in all cases.
Add support for expanding the HTML5 named character references, for example
The full list can be found here: https://www.w3.org/TR/html5/syntax.html#named-character-references
and a JSON form can also be found on that page.
The text was updated successfully, but these errors were encountered: