What's new? | Help | Directory | Sign in
Google
html5lib
Library for working with HTML5 documents
  
  
  
  
    
Search
for
Updated Feb 11, 2008 by geoffers
Labels: Documentation-User
UserDocumentation  
Documentation for using the library

Using html5lib

Installation

html5lib is packaged with setuptools. To install it use:

 $ python setup.py install

Tests

You may wish to check that your installation has been a success by running the testsuite. All the tests can be run by invoking runtests.py in the tests/ directory or by running

$ python setup.py test

Parsing HTML

Simple usage follows this pattern:

import html5lib
f = open("mydocument.html")
parser = html5lib.HTMLParser()
document = parser.parse(f)

This will return a tree in a custom "simpletree" format. More interesting is the ability to use a variety of standard tree formats; currently minidom, ElementTree and BeafutifulSoup formats are supported by default. To do this you need to pass a TreeBuilder class as the "tree" argument to the HTMLParser. For the built-in treebuilders this can be conveniently obtained from the treebuilders.getTreeBuilder function e.g. for minidom:

import html5lib
from html5lib import treebuilders

f = open("mydocument.html")
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))
minidom_document = parser.parse(f)

For a BeautifulSoup tree replace the string "dom" with "beautifulsoup". For ElementTree the procedure is slightly more involved as there are many libraries that support the ElementTree API. Therefore getTreeBuilder accepts a second argument which is the ElementTree implementation that is desired (in the future this may be extended, for example to allow multiple DOM libraries to be used):

import html5lib
from html5lib import treebuilders
from xml.etree import cElementTree

f = open("mydocument.html")
parser = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("etree", cElementTree))
etree_document = parser.parse(f)

SAX Events

The WHATWG spec is not very streaming-friendly as it requires rearrangement of subtrees in some situations. However html5lib allows SAX events to be created from a DOM tree using html5lib.treebuilders.dom.dom2sax

Character encoding

Parsed trees are always Unicode. However a large variety of input encodings are supported. The encoding of the document is determined in the following way:

Examples

Explicit encoding specification:

import html5lib
import urllib2
p = html5lib.HTMLParser()
p.parse(urllib2.urlopen("http://yahoo.co.jp", encoding="euc-jp").read())

Automatic detection from a meta element:

import html5lib
import urllib2
p = html5lib.HTMLParser()
p.parse(urllib2.urlopen("http://www.mozilla-japan.org/").read())

Sanitizing Tokenizer

When building web applications it is often necessary to remove unsafe markup and CSS from user entered content. html5lib provides a custom tokenizer for this purpose. It only allows known safe element tokens through and converts others to text. Similarly, a variety of unsafe CSS constructs are removed from the stream. For more details on the default configuration of the sanitizer, see http://wiki.whatwg.org/wiki/Sanitization_rules The sanitizer can be used by passing it as the tokenizer argument to the parser:

import html5lib
from html5lib import sanitizer

p = html5lib.HTMLParser(tokenizer=sanitizer.HTMLSanitizer)
p.parse("<script>alert('foo');</script>")

XML Parsing

html5lib comes with two classes for parsing XML documents (one specifically for XHTML). These are essentially the HTMLParser without the HTML-specific logic. The two available classes are html5lib.XMLParser and html5lib.XHTMLParser.

Example

Parsing a Atom feed:

import html5lib
import urllib2
p = html5lib.XMLParser()
p.parse(urllib2.urlopen("http://blog.whatwg.org/feed/").read())

Treewalkers

Treewalkers provide a streaming view of a tree. They are useful for filtering and serializing the stream. html5lib provides a variety of treewalkers for working with different tree types. For example, to stream a dom tree:

from html5lib import treewalkers
walker = treewalkers.getTreeWalker("dom")

stream = walker(dom_tree) #stream is an iterable representing each token in the
                          #tree

Treewalkers are avaliable for all the tree types supported by the HTMLParser plus xml.dom.pulldom ("pulldom"), genshi streams ("genshi") and a lxml-optimized elementtree ("lxml"). As for the treebulders, treewalkers.getTreeWalker takes a second argument implementation containing a object implementing the ElementTree API.

Sanitization using treewalkers

You may wish to sanitize content from an which has been parsed into a tree by some other code. This may be done using the sanitizer filter:

from html5lib import treewalkers, filters
from html5lib.filters import sanitizer

walker = treewalkers.getTreeWalker("dom")

stream = walker(dom_tree)
clean_stream = sanitizer.Filter(stream)

Serialization of Streams

html5lib provides HTML and XHML serializers which work on streams produced by the treewalkers. These are implemented as generators with each item in the generator representing a single tag. A full example of parsing and serializing content looks like:

import html5lib
from html5lib import treebuilders, treewalkers, serializer
from html5lib.filters import sanitizer

p = html5lib.HTMLParser(tree=treebuilders.getTreeBuilder("dom"))

dom_tree = p.parse("<p><strong>Hello</strong> World</p>")

walker = treewalkers.getTreeWalker("dom")

stream = walker(dom_tree)

s = serializer.htmlserializer.HTMLSerializer(omit_optional_tags=False)
output_generator = s.serialize(stream)

for item in output_generator:
    print item

<html>
<head>
</head>
<body>
<p>
<strong>
Hello
</strong>
 
World
</p>
</body>
</html>

Validation

The html5lib.filters.validator module contains a partial implementation of a HTML5 validator.

Bugs

Please report any bugs on the issue tracker: http://code.google.com/p/html5lib/issues/list

Get Involved

Contributions to code or documenation are actively encouraged. Submit patches to the issue tracker or discuss changes on irc in the #whatwg channel on freenode.net


Comment by serjux, Jun 18, 2008

Some practically examples of using html5lib with minidom for example should be nice , some complete code example would be nice


Sign in to add a comment