My favorites | Sign in
Project Home Downloads Wiki Issues Source
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 69: charsUntil is slow (Python)
2 people starred this issue and may be notified of changes. Back to list
Status:  WFM
Owner:  ----
Closed:  Apr 2013
Cc:  geoffers


Sign in to add a comment
 
Project Member Reported by geoffers, May 15, 2008
charsUntil takes up a large (40–50% in some cases) of the processing time of html5lib. It includes a 
comment suggesting one or two things, quoted below:

#This method is currently 40-50% of our total runtime and badly needs
#optimizing
#Possible improvements:
# - use regexp to find characters that match the required character set
#   (with regexp cache since we do the same searches many many times)
# - improve EOF handling for fewer if statements

The attached patch starts on the former, though certainly has bugs (as it fails a great deal of test-
cases).
re.patch
2.5 KB   View   Download
May 27, 2008
#1 exc...@gmail.com
I've committed (r1154) some changes based partly on this patch, to store strings 
more often as strings (rather than lists) and to use regexps over them.

Parsing the HTML5 spec ("time python parse.py spec.html --treebuilder=lxml --no-
html"), those changes reduce the time from 25.8s by 15% to 22.0s.

Parsing Project Gutenberg's HTML version of "The Iliad by Homer", this reduces the 
time from 22.0s by 25% to 16.8s.

charsUntil and char still take a large amount of time, so I'm leaving this issue 
open in the hope that someone will fix it better.
Jun 2, 2008
#2 jgraham....@googlemail.com
(No comment was entered for this change.)
Labels: Port-Python
Apr 9, 2013
Project Member #3 geoffers
Meh, it's good enough.
Status: WFM
Sign in to add a comment

Powered by Google Project Hosting