|
|
A ruby/python based HTML parser/tokenizer based on the WHATWG HTML5 specification for maximum compatibility with major desktop web browsers.
0.10 Release Features
- Parses valid and invalid HTML documents to a tree
- Support for minidom, ElementTree (including cElementTree and lxml.etree), BeautifulSoup and custom simpletree output formats
- DOM to SAX converter
- Reports parse errors
- Character encoding detection
- XML mode for working with illformed XML e.g. feeds
- Filtering and serializing of trees
- HTML+CSS sanitizer
- Many unit tests
Known Issues (0.10)
- Python 2.3 users will fail several encoding related tests unless they install the cjkcodecs module
Documentation
Getting help/getting involved
- IRC: the #whatwg channel on the Freenode IRC server
- html5lib-discuss mailing list.
