Export to GitHub

abot - issue #77

HtmlAgilityPack throws StackOverflowException on pages with lots of nested tags


Posted on Mar 8, 2013 by Helpful Dog

HtmlAgilityPack throws StackOverflowException on pages with lots of nested tags. This occurs during HtmlDocument.LoadHtml(string). Attached 2 html files that if their content is loaded will throw a StackOverflowException.

Attachments

Comment #1

Posted on Mar 8, 2013 by Helpful Dog

This issue was closed by revision r281.

Comment #2

Posted on Mar 8, 2013 by Helpful Dog

Patched html agility to fix this issue. Added HtmlDocument.OptionMaxNestedChildNodes that can be set to prevent StackOverflowExceptions that are caused by tons of nested tags. It will throw an ApplicationException with message "Document has more than X nested tags. This is likely due to the page not closing tags properly."

Usage... HtmlDocument hapDoc = new HtmlDocument(); hapDoc.OptionMaxNestedChildNodes = 5000; try { hapDoc.LoadHtml(RawContent); } catch (Exception e) { hapDoc.LoadHtml(""); }

Attached new HtmlAgilityPack.dll assembly. Will submit this patch to the HtmlAgilityPack project site.

Attachments

Comment #3

Posted on Mar 8, 2013 by Helpful Dog

Added all source and binary to the hap project site...

http://www.codeplex.com/site/users/view/sjdirect

Comment #4

Posted on Mar 8, 2013 by Helpful Dog

Attached full patch zip submitted to hap project

Attachments

Comment #5

Posted on Mar 16, 2013 by Massive Panda

Comment deleted

Comment #6

Posted on Sep 3, 2013 by Swift Lion

Hello, Although I am using - as recommended - your HtmlAgilityPack.dll , I am still getting the StackOverFlow exception, Please check the screenshot.

Hope you can help in this.

Thanks in advance.

Attachments

Comment #7

Posted on Sep 3, 2013 by Helpful Dog

Hi, can you narrow it down to a single page/url? HAP uses many stacks in its implementation. I only fixed one related to nested html tags, it is likely that there are other conditions that can cause stackoverflows.

Comment #8

Posted on Sep 4, 2013 by Swift Lion

Hello, I was applying the crawler to the following site : http://www.gesetze-im-internet.de/aktuell.html , getting the xmls within it, its over 200 000 pages with nested html Tags. Somehow I think its related with VisualStudio Stack, I will test this today, just wanted to let you know :)

Comment #9

Posted on Jul 10, 2015 by Helpful Dog

Turns out that using HtmlDocument.OptionFixNestedTags = true solves this issue without needing the patched version..

Status: Fixed

Labels:
Type-Defect Priority-High