|Issue 69:||charsUntil is slow (Python)|
|2 people starred this issue and may be notified of changes.||Back to list|
charsUntil takes up a large (40–50% in some cases) of the processing time of html5lib. It includes a comment suggesting one or two things, quoted below: #This method is currently 40-50% of our total runtime and badly needs #optimizing #Possible improvements: # - use regexp to find characters that match the required character set # (with regexp cache since we do the same searches many many times) # - improve EOF handling for fewer if statements The attached patch starts on the former, though certainly has bugs (as it fails a great deal of test- cases).
May 27, 2008
I've committed (r1154) some changes based partly on this patch, to store strings more often as strings (rather than lists) and to use regexps over them. Parsing the HTML5 spec ("time python parse.py spec.html --treebuilder=lxml --no- html"), those changes reduce the time from 25.8s by 15% to 22.0s. Parsing Project Gutenberg's HTML version of "The Iliad by Homer", this reduces the time from 22.0s by 25% to 16.8s. charsUntil and char still take a large amount of time, so I'm leaving this issue open in the hope that someone will fix it better.
Jun 2, 2008
(No comment was entered for this change.)
Apr 9, 2013
Meh, it's good enough.
|► Sign in to add a comment|