My favorites | Sign in
Project Home Downloads Wiki Issues Source
How to get started
Updated Aug 19, 2010 by ckkohl79


  1. Get a binary and optionally also the source tarball from the Downloads page
  2. Extract the files somewhere
  3.     tar xvzf boilerpipe-VERSION-bin.tar.gz
        tar xvzf boilerpipe-VERSION-src.tar.gz
  4. Add boilerpipe-VERSION.jar, nekohtml-1.9.13.jar and xerces-2.9.1.jar to your Java classpath (these jar files are included in the binary tarball).
(Whereas VERSION needs to be replaced by boilerpipe's version number, e.g. 1.0.3).


Once you have boilerpipe on your classpath, extracting the "main" content from a Web page is really simple:

   URL url = new URL("");
   // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you
   String text = ArticleExtractor.INSTANCE.getText(url);

Please also have a look at the demo classes.

Even though there is a DefaultExtractor which should work somehow well on any type of content, there are other extractors that may be more suitable for particular content scenarios. For example, ArticleExtractor adds some heuristics to extract the main content from a news page (this the usual scenario for boilerplate removal, so use this unless you are absolutely sure what you are doing). There is also NumWordsRulesExtractor which resembles the number-of-words-based decision tree presented in the WSDM 2010 Paper (Algorithm 2).

There are also other ways to call the Extractor (using Reader, InputSource, String etc.).

See the API Javadocs for details.

Building from Source

To work with the source, please either check out a version from the SVN repository or download the tarballs of the latest release. In the latter case, you will need both the binary and the source tarball, since the binary contains the dependency libraries. Just extract them into the same directory.

To build the distribution jars and archives from the sources just run the build.xml ant script:


To just build the jar files run

   ant jars

Maven Repository

You may download the jar from my local Maven Repository:


You are invited to improve, customize and extend Boilerpipe. See here for details on how the boilerpipe components work together.

Comment by, Dec 14, 2009

It would be useful to have a "building from source" section. Only issue currently is that you have to create a lib/ sub-dir in the source distribution's directory, then copy over the NekoHTML and Xerces jars from the binary distribution.

Comment by project member ckkohl79, Dec 14, 2009

Added "Building from Source" section -- thanks Ken!

Comment by, Apr 29, 2011

Is it possible, to extract title and content separately? Using the ArticleExtractor?.INSTANCE.getText returns only one string. Thanks for the help

Comment by, May 17, 2011

It would be awesome if you could include a simple command line utility to output the response for a given URL, for those of us who don't speak java

Comment by, Oct 3, 2011

If you could write a detailed step by step tutorial for non Java users it would be very much appreciated. I get "Exception in thread main" when trying to compile and run Classpath complexities make it more difficult.

Comment by, Nov 17, 2011

I'll second the request for a command line feature.

Comment by, Nov 22, 2011

How about adding the latest version to the maven central repo?

Comment by, Jan 25, 2012

Third request for a command-line-option! :-) Great work!

Comment by, Feb 14, 2012

You have duplicate classes in nekohtml-1.9.13.jar and boilerpipe-1.2.0.jar :

Comment by project member ckkohl79, Feb 15, 2012

Hi o.m.osmanov, yes, the ones included in boilerpipe are the patched versions.

Comment by, Apr 17, 2012

Hi Christian,

Do you have any examples of how to swap nekohtml out and to use Tagsoup instead?

Thanks Rodders

Comment by, Apr 26, 2012

This page is utterly incomplete. Getting started with Boilerpipe is painful, be prepared for that. It requires all kinds of dependencies that are all optional and never documented. I am over an hour into getting it to work and it still cannot find Neko classes. Thumbs down to the author.

Comment by, Apr 26, 2012

A list of dependencies that worked for me with Boilerpipe 1.2.0: Documenting things like this is a must for the QuickStart guide.

Comment by, Jul 3, 2012

How do you format the output as JSON?

Comment by, Sep 7, 2012

Hello Christian, I wanted to know how to use TextDocumentStat? class for simpleEstimator. Particularly, for

TextDocument textDoc=new TextDocument();
 boolean contentOnly=false; 
 TextDocumentStatistics tdBefore=new TextDocumentStatistics(textDoc,contentOnly); 
//What should be the value of contentOnly flag here for proper estimation 

 TextDocumentStatistics tdAfter=new TextDocumentStatistics(textDoc,true); 
//I believe here i am supposed to pass it as true boolean 

Comment by, Nov 6, 2012

Christian, Boilerpipe is great but the quality of this 'Quick Start' guide is poor.

For those wanting a simpler solution using Python as a wrapper, check out:

Comment by, Mar 26, 2013

Hi, I wonder if anyone has tried this library on Wikipedia? What might be the best extractor to use? The default extractor seems to work ok (thought it includes "see also" links -- I'm not really sure they count as boilerplate), but I was wondering if I should customized extractor? Thanks!

Comment by, Apr 29, 2013

This version is NOT the same as in the online demo, which works significantly?


Consider this point before loosing time testing this package.

Comment by, Apr 29, 2013

+: which works significantly better as the version to download?

Comment by, May 2, 2013

Also, after struggling witgh UTF8 encoding support, It seems to be somehow. Simply check those 2 utf8 urls with online boilerpipe:

test1: test2:

One will work, and the other will output some replacement characters.

Ths issue remains with the version to download after setting utf8 following the instructions given by the package creator.

Comment by, Jul 9, 2013

I added the three jars to my classpath. but when I run a simple test code it gives me ClassDefNotFoundError?...for any extractor type...i don't get it

Comment by, Oct 5, 2013

I have not had any success with Japanese URLs like: Though the same works perfectly on the web api. I tried all the tricks mentioned in StackOF, like changing code.

Comment by, Apr 21, 2014

If NekoHTML and Xerces are required for boilerpipe then please add them as dependencies in the pom file.

Comment by fccoelho, Aug 30, 2014

A little command-line interface would be very nice

Sign in to add a comment
Powered by Google Project Hosting