My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
QuickStart  
How to get started
Updated Aug 19, 2010 by ckkohl79

Installation

  1. Get a binary and optionally also the source tarball from the Downloads page
  2. Extract the files somewhere
  3.     tar xvzf boilerpipe-VERSION-bin.tar.gz
        tar xvzf boilerpipe-VERSION-src.tar.gz
  4. Add boilerpipe-VERSION.jar, nekohtml-1.9.13.jar and xerces-2.9.1.jar to your Java classpath (these jar files are included in the binary tarball).
(Whereas VERSION needs to be replaced by boilerpipe's version number, e.g. 1.0.3).

Usage

Once you have boilerpipe on your classpath, extracting the "main" content from a Web page is really simple:

   URL url = new URL("http://www.example.com/some-location/index.html");
   // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
   String text = ArticleExtractor.INSTANCE.getText(url);

Please also have a look at the demo classes.

Even though there is a DefaultExtractor which should work somehow well on any type of content, there are other extractors that may be more suitable for particular content scenarios. For example, ArticleExtractor adds some heuristics to extract the main content from a news page (this the usual scenario for boilerplate removal, so use this unless you are absolutely sure what you are doing). There is also NumWordsRulesExtractor which resembles the number-of-words-based decision tree presented in the WSDM 2010 Paper (Algorithm 2).

There are also other ways to call the Extractor (using Reader, InputSource, String etc.).

See the API Javadocs for details.

Building from Source

To work with the source, please either check out a version from the SVN repository or download the tarballs of the latest release. In the latter case, you will need both the binary and the source tarball, since the binary contains the dependency libraries. Just extract them into the same directory.

To build the distribution jars and archives from the sources just run the build.xml ant script:

   ant

To just build the jar files run

   ant jars

Maven Repository

You may download the jar from my local Maven Repository:

http://boilerpipe.googlecode.com/svn/repo/

Customizing

You are invited to improve, customize and extend Boilerpipe. See here for details on how the boilerpipe components work together.

Comment by kkrugler...@transpac.com, Dec 14, 2009

It would be useful to have a "building from source" section. Only issue currently is that you have to create a lib/ sub-dir in the source distribution's directory, then copy over the NekoHTML and Xerces jars from the binary distribution.

Comment by project member ckkohl79, Dec 14, 2009

Added "Building from Source" section -- thanks Ken!

Comment by cepeda.n...@gmail.com, Apr 29, 2011

Is it possible, to extract title and content separately? Using the ArticleExtractor?.INSTANCE.getText returns only one string. Thanks for the help

Comment by litchfie...@gmail.com, May 17, 2011

It would be awesome if you could include a simple command line utility to output the response for a given URL, for those of us who don't speak java

Comment by horia.cr...@gmail.com, Oct 3, 2011

If you could write a detailed step by step tutorial for non Java users it would be very much appreciated. I get "Exception in thread main" when trying to compile and run Oneliner.java. Classpath complexities make it more difficult.

Comment by ajcar1...@gmail.com, Nov 17, 2011

I'll second the request for a command line feature.

Comment by akpr...@gmail.com, Nov 22, 2011

How about adding the latest version to the maven central repo?

Comment by kim.oliv...@googlemail.com, Jan 25, 2012

Third request for a command-line-option! :-) Great work!

Comment by o.m.osma...@gmail.com, Feb 14, 2012

You have duplicate classes in nekohtml-1.9.13.jar and boilerpipe-1.2.0.jar : org.cyberneko.html.HTMLElements.java org.cyberneko.html.HTMLTagBalancer.java

Comment by project member ckkohl79, Feb 15, 2012

Hi o.m.osmanov, yes, the ones included in boilerpipe are the patched versions.

Comment by subd...@gmail.com, Apr 17, 2012

Hi Christian,

Do you have any examples of how to swap nekohtml out and to use Tagsoup instead?

Thanks Rodders

Comment by Michael....@gmail.com, Apr 26, 2012

This page is utterly incomplete. Getting started with Boilerpipe is painful, be prepared for that. It requires all kinds of dependencies that are all optional and never documented. I am over an hour into getting it to work and it still cannot find Neko classes. Thumbs down to the author.

Comment by Michael....@gmail.com, Apr 26, 2012

A list of dependencies that worked for me with Boilerpipe 1.2.0: https://gist.github.com/ada3f0d60e3f175a7362. Documenting things like this is a must for the QuickStart guide.

Comment by chu...@gmail.com, Jul 3, 2012

How do you format the output as JSON?

Comment by thisisno...@gmail.com, Sep 7, 2012

Hello Christian, I wanted to know how to use TextDocumentStat? class for simpleEstimator. Particularly, for

TextDocument textDoc=new TextDocument();
 boolean contentOnly=false; 
 TextDocumentStatistics tdBefore=new TextDocumentStatistics(textDoc,contentOnly); 
//What should be the value of contentOnly flag here for proper estimation 

ArticleExtractor.INSTANCE.process(textDoc); 
 TextDocumentStatistics tdAfter=new TextDocumentStatistics(textDoc,true); 
//I believe here i am supposed to pass it as true boolean 

qualityFlag=SimpleEstimator.isLowQuality(tdBefore,tdAfter);
Comment by w...@fig-books.com, Nov 6, 2012

Christian, Boilerpipe is great but the quality of this 'Quick Start' guide is poor.

For those wanting a simpler solution using Python as a wrapper, check out: https://github.com/misja/python-boilerpipe/

Comment by han...@gmail.com, Mar 26, 2013

Hi, I wonder if anyone has tried this library on Wikipedia? What might be the best extractor to use? The default extractor seems to work ok (thought it includes "see also" links -- I'm not really sure they count as boilerplate), but I was wondering if I should customized extractor? Thanks!

Comment by mobern...@gmail.com, Apr 29, 2013

This version is NOT the same as in the online demo, which works significantly?

Why?

Consider this point before loosing time testing this package.

Comment by mobern...@gmail.com, Apr 29, 2013

+: which works significantly better as the version to download?

Comment by mobern...@gmail.com, May 2, 2013

Also, after struggling witgh UTF8 encoding support, It seems to be somehow. Simply check those 2 utf8 urls with online boilerpipe:

http://boilerpipe-web.appspot.com/

test1: http://www.tecfinance.fr/blog/question-rachat-de-credit/rachat-de-credit-et-anciennete-professionnelle test2: http://www.aidefinanciere.net/regroupement-de-credit-auto-entrepreneur-rachat-de-credit-autoentrepreneur/

One will work, and the other will output some replacement characters.

Ths issue remains with the version to download after setting utf8 following the instructions given by the package creator.

Comment by masterAc...@gmail.com, Jul 9, 2013

I added the three jars to my classpath. but when I run a simple test code it gives me ClassDefNotFoundError?...for any extractor type...i don't get it

Comment by navra...@gmail.com, Oct 5, 2013

I have not had any success with Japanese URLs like: http://d.hatena.ne.jp/mkusunok/20130817/p1 Though the same works perfectly on the web api. I tried all the tricks mentioned in StackOF, like changing code.

Comment by mark...@gmail.com, Apr 21, 2014

If NekoHTML and Xerces are required for boilerpipe then please add them as dependencies in the pom file.

Comment by fccoelho, Aug 30, 2014

A little command-line interface would be very nice


Sign in to add a comment
Powered by Google Project Hosting