Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html5lib parses extremely slow certain type of content #15076

Closed
DartBot opened this issue Nov 14, 2013 · 11 comments
Closed

html5lib parses extremely slow certain type of content #15076

DartBot opened this issue Nov 14, 2013 · 11 comments
Labels
area-pkg Used for miscellaneous pkg/ packages not associated with specific area- teams.

Comments

@DartBot
Copy link

DartBot commented Nov 14, 2013

This issue was originally filed by kaisellgren@gmail.com


The following code demonstrates a very slow parsing speed:

  http.read('https://www.facebook.com/MARCA').then((c) {
    htmlParser.parse(c);
  });

For most websites out there, it's fast, but for a few and this site in particular, it parses in around 20-25 seconds.

@sethladd
Copy link
Contributor

Is this when running on the VM or via dart2js ?

Routing to VM because they might be interested in performance reports like this.

(FWIW I've heard the same report when trying to parse a large SVG document. Fast via dart2js but slower via VM.)


cc @jmesserly.
Added Area-VM, Triaged labels.

@DartBot
Copy link
Author

DartBot commented Nov 16, 2013

This comment was originally written by kaisellgren@gmail.com


Sorry I wasn't clear. This happens on the VM. Latest stable.

@iposva-google
Copy link
Contributor

Can you please give a bit more context what your example is doing? In particular I am missing how you allocated the htmlParser object and where you got its class from.

Since this is filed against parsing, I assume we can just download the example URL once and feed a test from the file, correct?


Added NeedsInfo label.

@DartBot
Copy link
Author

DartBot commented Nov 26, 2013

This comment was originally written by kaisellgren@gmail.com


Sure, I don't see why we can't save it on a file. Here's a full code snippet to try against:

  library test;

  import 'package:http/http.dart' as http;
  import 'package:html5lib/parser.dart' as htmlParser;

  void main() {
    http.read('https://www.facebook.com/MARCA').then((c) {
      var w = new Stopwatch()..start();
      htmlParser.parse(c);
      print(w.elapsed);
    });
  }

It prints 11½ seconds for me. This happens consistently.

@sethladd
Copy link
Contributor

Here is a directory with all the pub setup ready to go.

unzip
run pub get
dart bin/xml.dart


Attachment:
testxml.zip (2.31 KB)

@iposva-google
Copy link
Contributor

Thanks for the reproduction instructions.

It turns out that for the sample in question we spend a lot of time copying strings due to tokenizer.dart appending characters to a string with interpolation:

currentStringToken.data = "${currentStringToken.data}-${data}";

Using a StringBuffer would be the right way to go in this situation. Also there are many other places like this in this particular source file.


Set owner to @jmesserly.
Removed Area-VM label.
Added Area-Pkg, Library-Html5lib, Accepted labels.

@sethladd
Copy link
Contributor

sethladd commented Dec 2, 2013

Thanks for the analysis.

@jmesserly
Copy link

Nice tracking that down! wow, turns out this is a really old bug :)
dart-lang/html#3

Patches are welcome for this. Probably pretty easy now that we have proper token classes and APIs use types. Originally the Python code used tokens in a very untyped way (they were just maps, "data" had different meaning in different places depending on the tokenizer state) ... now that it's sorted out into "currentStringToken" and APIs are typed, it's probably pretty easy to find the .data concatenations.


Removed the owner.
Added PatchesWelcome label.

@kevmoo
Copy link
Member

kevmoo commented Feb 13, 2014

Added Pkg-Html5Lib label.

@kevmoo
Copy link
Member

kevmoo commented Feb 13, 2014

Removed Library-Html5lib label.

@DartBot DartBot added Type-Defect area-pkg Used for miscellaneous pkg/ packages not associated with specific area- teams. labels Feb 13, 2014
@DartBot
Copy link
Author

DartBot commented Jun 4, 2015

This issue has been moved to dart-lang/html#18.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-pkg Used for miscellaneous pkg/ packages not associated with specific area- teams.
Projects
None yet
Development

No branches or pull requests

5 participants