|
FAQ
Frequently Asked Questions
Will it work with any web page?Probably not. But you should give it a try. The algorithms used in Boilerpipe should be quite content-independent. Please make sure that you first try with English text, preferably from News or Blog sites (the algorithms have been trained on such data). Then attempt to extract text from other sources and see how it goes. If you extract non-English text you might need to change some parameters (this is not yet automatized). See the WSDM 2010 paper for details. If your page is really short, i.e. it contains few words (or just one sentence per paragraph etc.), you might get worse results than for longer ones. If your page does not contain sufficient HTML text (i.e., PDF, Flash or JavaScript), there is nothing Boilerpipe can do about it at the moment. Try converting the output to HTML or TextDocuments and feed it to a Boilerpipe extractor. It does not work for page XMaybe it is an HTML parsing problem? Boilerpipe uses the NekoHTML parser library to get a valid SAX tree (to skip elements like SCRIPT, OPTION etc. and to detect text linked by A) Boilerpipe user Kris Jirapinyo reported a case where the NekoHTML parser library failed to parse the input HTML correctly and caused low-quality extraction results (e.g., for this page). This was fixed in boilerpipe 1.0.2. Boilerpipe now also actively monitors what is coming from the HTML parser and will throw a BoilerpipeProcessingException if a knowingly incorrect input is received. So, in case you are not happy with the extraction of a particular web page, please try to clean the URL's HTML code before sending it to boilerpipe. If you have Firefox with the Web Developer Extension, you just need to click on "View Generated Source" and save that HTML to disk. If it then works, it is probably a bug in NekoHTML. In any case, please file an issue with all necessary information (e.g., URL information and type of error). If possible please also attach the HTML in question. |
This is one of the best algorithms I have found for text extraction. I have been using for news extraction in Spanish and it works fairly well with the default parameters. Congratulations!
You should make a page to list who is using it. As a first start, you could add the WebLab?-project (http://weblab.ow2.org).
Is there anywhere to submit test cases that show poor extraction results? I have had little success on a reasonably straightforward page of my own website, here: gfxmonk.net/2010/09/26/why-zero-install-will-succeed.html
most notably, almost all of the classifiers seem to throw away parts of HTML elements (e.g some of the parts contained in a given <p> tag will appear, but some others in the same tag will not). Also, most of the classifiers seem uninterested in keeping the <h1> in the extracted content.
@gfxmonk:
This will be fixed in boilerpipe 1.2.0. The problem was that "CODE" was not treated as an inline element. See http://boilerpipe-web.appspot.com/ for a pre-release version, and try it with your page. (PS: Please file an issue for such problems, the comment system is not meant for that)
Cheers, Christian
Is there anyone that use Boilerpipe with Italian documents? I've done some test with Italian contents and the result with Article filter is not so good. I'm searching how to change the Boilerpipe parameters.
Is there a way to see a html tag label from filters such as LabelToContentFilter?.. I want to clean the extracted document a little by rejecting content with certain div#id.
I am trying to extract drink recipes from several websites. I have already downloaded 20000 in HTML. My idea is to use boilerpipe (kudos on the name) clean those nasty banners and excess data. The final file should be an XML with the tags for the name of the drink, ingredients and type of glass.
Could you please clarify the difference between ArticleExtractor? and ArticleSentencesExtractor?? What precisely is the difference between what each aims to do? Thanks.