|
Project Information
Members
Featured
Downloads
Wiki pages
Links
|
What is HBase-Writer?HBase-Writer is an extension to the Heritrix open source crawler written by the Internet Archive (http://crawler.archive.org/) that enables it to store crawled content directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://hadoop.apache.org/core/). HBase-Writer writes crawled content into a given hbase table as individual records or "rowkeys". In turn, these tables are directly supported by the MapReduce framework via HBase and Hadoop. HBase-Writer's goal is to facilitate in fast large distributed crawls using Heritrix and to save and manage Web-scale content using HBase. NewsJanuary 22nd, 2012HBase-Writer 0.90.4 has now been released. This version was a major bug fix to the previous version. The last release inadvertently removed resource and connection pooling. Greg Lu was good enough to catch this and nice enough to share the patch with hbase-writer. A big thank you to Greg Lu for doing great work maintaining the integrity of this plugin. Hbase clients are being reused and connections are being closed under the new implementation. Next on the list is to create some unit tests to ensure pooling works as expected. November 16th, 2011HBase-Writer 0.90.3 has now been released. This version is a compatibility upgrade to support the latest versions of HBase and Heritrix which today are HBase-0.90.3 and Heritrix-3.1.0. Great thanks to Karthik MV with Infiniti-Research for submitting a compatibility patch! March 29th, 2010HBase-Writer 0.9-SNAPSHOT has now been released. This version is compatible with both Heritrix 2.X and Heritrix 3.X. Much thanks to Greg Lu for spearheading this effort and sending in the initial patch. Once Heritrix has an official 3.0.0-RELEASE, then HBase-writer will release version 0.9-RELEASE. Thanks again Greg! |