|
|
A ruby/python based HTML parser/tokenizer based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.
0.11 Release Features
- Parses valid and invalid HTML documents to a tree
- Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup and custom simpletree output formats
- DOM to SAX converter
- Reports parse errors
- Character encoding detection
- XML mode for working with illformed XML e.g. feeds
- Filtering and serializing of trees
- HTML+CSS sanitizer
- Many unit tests
- Faster than before :)
Known Issues (0.11)
- Python 2.3 users will fail several encoding related tests unless they install the cjkcodecs module
- Users of some python builds (notably the default OSX python) may experience one test failure — this is not believed to be significant
Documentation
Getting help/getting involved
- IRC: the #whatwg channel on the Freenode IRC server
- html5lib-discuss mailing list.
