
abot - issue #77
HtmlAgilityPack throws StackOverflowException on pages with lots of nested tags
HtmlAgilityPack throws StackOverflowException on pages with lots of nested tags. This occurs during HtmlDocument.LoadHtml(string). Attached 2 html files that if their content is loaded will throw a StackOverflowException.
- HtmlAgilityPackStackOverflow1.html 919.97KB
- HtmlAgilityPackStackOverflow2.html 409.99KB
Comment #1
Posted on Mar 8, 2013 by Helpful DogThis issue was closed by revision r281.
Comment #2
Posted on Mar 8, 2013 by Helpful DogPatched html agility to fix this issue. Added HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent StackOverflowExceptions that are caused by tons of nested tags. It will throw an ApplicationException with message "Document has more than X nested tags. This is likely due to the page not closing tags properly."
Usage... HtmlDocument hapDoc = new HtmlDocument(); hapDoc.OptionMaxNestedChildNodes = 5000; try { hapDoc.LoadHtml(RawContent); } catch (Exception e) { hapDoc.LoadHtml(""); }
Attached new HtmlAgilityPack.dll assembly. Will submit this patch to the HtmlAgilityPack project site.
- HtmlAgilityPack.dll 132.5KB
Comment #3
Posted on Mar 8, 2013 by Helpful DogAdded all source and binary to the hap project site...
Comment #4
Posted on Mar 8, 2013 by Helpful DogAttached full patch zip submitted to hap project
- HapStackOverFlowPatch.zip 5.11MB
Comment #5
Posted on Mar 16, 2013 by Massive PandaComment deleted
Comment #6
Posted on Sep 3, 2013 by Swift LionHello, Although I am using - as recommended - your HtmlAgilityPack.dll , I am still getting the StackOverFlow exception, Please check the screenshot.
Hope you can help in this.
Thanks in advance.
- stackoverflow.png 234.86KB
Comment #7
Posted on Sep 3, 2013 by Helpful DogHi, can you narrow it down to a single page/url? HAP uses many stacks in its implementation. I only fixed one related to nested html tags, it is likely that there are other conditions that can cause stackoverflows.
Comment #8
Posted on Sep 4, 2013 by Swift LionHello, I was applying the crawler to the following site : http://www.gesetze-im-internet.de/aktuell.html , getting the xmls within it, its over 200 000 pages with nested html Tags. Somehow I think its related with VisualStudio Stack, I will test this today, just wanted to let you know :)
Comment #9
Posted on Jul 10, 2015 by Helpful DogTurns out that using HtmlDocument.OptionFixNestedTags = true solves this issue without needing the patched version..
Status: Fixed
Labels:
Type-Defect
Priority-High