How to get started
tar xvzf boilerpipe-VERSION-bin.tar.gz tar xvzf boilerpipe-VERSION-src.tar.gz
(Whereas VERSION needs to be replaced by boilerpipe's version number, e.g. 1.0.3).
Once you have boilerpipe on your classpath, extracting the "main" content from a Web page is really simple:
URL url = new URL("http://www.example.com/some-location/index.html"); // NOTE: Use ArticleExtractor unless DefaultExtractor gives better results for you String text = ArticleExtractor.INSTANCE.getText(url);
Please also have a look at the demo classes.
Even though there is a DefaultExtractor which should work somehow well on any type of content, there are other extractors that may be more suitable for particular content scenarios. For example, ArticleExtractor adds some heuristics to extract the main content from a news page (this the usual scenario for boilerplate removal, so use this unless you are absolutely sure what you are doing). There is also NumWordsRulesExtractor which resembles the number-of-words-based decision tree presented in the WSDM 2010 Paper (Algorithm 2).
There are also other ways to call the Extractor (using Reader, InputSource, String etc.).
See the API Javadocs for details.
To work with the source, please either check out a version from the SVN repository or download the tarballs of the latest release. In the latter case, you will need both the binary and the source tarball, since the binary contains the dependency libraries. Just extract them into the same directory.
To build the distribution jars and archives from the sources just run the build.xml ant script:
To just build the jar files run
You may download the jar from my local Maven Repository:
You are invited to improve, customize and extend Boilerpipe. See here for details on how the boilerpipe components work together.