What steps will reproduce the problem? 1. Use the HTMLHighlighter to extract the relevant html-code from a page: final BoilerpipeExtractor extractor = CommonExtractors.ARTICLE_EXTRACTOR; final HTMLHighlighter hh = HTMLHighlighter.newExtractingInstance(); System.out.println(hh.process(url, extractor)); 2. Try to parse this page: http://www.golem.de/1102/81290.html
What is the expected output? What do you see instead? This should be the output: <H2> Daniel Domscheit-Berg </H2> <H1> Wikileaks-Aussteiger haben Unterlagen mitgenommen </H1> ...
But actually I get this: Daniel Domscheit-Berg </H2> Wikileaks-Aussteiger haben Unterlagen mitgenommen </H1> ...
What version of the product are you using? On what operating system? - Boilerplate 1.1.0 binary - OS: Suse
Is it possible to generate exactly the output which the Web API produces? There are even other tags which seem to be missing like <TABLE> and <TD>.
Comment #1
Posted on Feb 10, 2011 by Quick WombatActually this issue also affects other elements (sometimes) - like "
". Have a look at this page: - http://www.n-tv.de/politik/Bewaehrungsstrafe-fuer-Tims-Vater-article2575771.html
I think that somehow the detection of text blocks must be buggy.
Comment #2
Posted on Feb 10, 2011 by Happy CatThanks for reporting.
This bug has been fixed in boilerpipe 1.2, which will be released in the next few days.
Comment #3
Posted on Feb 10, 2011 by Quick WombatSuperb! I guess you are already using the fixed version for your Website API, then. I'm already looking forward to giving the new release a trial!
Status: Fixed
Labels:
Type-Defect
Priority-Medium