Google Code Archive - Long-term storage for Google Code Project Hosting.

Posted on Feb 10, 2011 by Quick Wombat

What steps will reproduce the problem? 1. Use the HTMLHighlighter to extract the relevant html-code from a page: final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR; final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance(); System.out.println(hh.process(url, extractor)); 2. Try to parse this page: http://www.golem.de/1102/81290.html

What is the expected output? What do you see instead? This should be the output: <H2> Daniel Domscheit-Berg </H2> <H1> Wikileaks-Aussteiger haben Unterlagen mitgenommen </H1> ...

But actually I get this: Daniel Domscheit-Berg </H2> Wikileaks-Aussteiger haben Unterlagen mitgenommen </H1> ...

What version of the product are you using? On what operating system? - Boilerplate 1.1.0 binary - OS: Suse

Is it possible to generate exactly the output which the Web API produces? There are even other tags which seem to be missing like <TABLE> and <TD>.

Comment #1

Posted on Feb 10, 2011 by Quick Wombat

Actually this issue also affects other elements (sometimes) - like "

". Have a look at this page: - http://www.n-tv.de/politik/Bewaehrungsstrafe-fuer-Tims-Vater-article2575771.html

I think that somehow the detection of text blocks must be buggy.

Comment #2

Posted on Feb 10, 2011 by Happy Cat

Thanks for reporting.

This bug has been fixed in boilerpipe 1.2, which will be released in the next few days.

Comment #3

Posted on Feb 10, 2011 by Quick Wombat

Superb! I guess you are already using the fixed version for your Website API, then. I'm already looking forward to giving the new release a trial!

boilerpipe - issue #17

Comment #1

Comment #2

Comment #3