My favorites | Sign in
Logo
                
Search
for
Updated Feb 16, 2009 by ryan.justin.smith
Labels: Featured
README  

This is a processor for heritrix that writes fetched pages to hbase.

Introduction

The layout of this contribution is modeled after Doug Judd's heritrix-hadoop-dfs-processor available off the heritrix home page.

This software is licensed under the LGPL. See accompanying LICENSE.txt document.

The hbase-writer is an extension to the Heritrix open source crawler written by the Internet Archive (http://crawler.archive.org/) that enables it to store crawled content directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://lucene.apache.org/hadoop/). HBase-Writer writes crawled content into a given hbase table as records. In turn, these tables are directly supported by the Map/Reduce framework via HBase so Map/Reduce jobs can be done on them.

The current version of hbase-writer assumes version 2.0.x of Heritrix, version 0.19.x of Hadoop/HBase and Java 1.6. Newer versions of Hadoop and Heritrix may continue to work with this connector as long as the pertinent APIs have not changed. Just replace the jar files with the newer versions.

SETUP

hbase-writer-x.x.x.jar
hbase-x.x.x.jar
hadoop-x.x.x-core.jar
log4j-x.x.x.jar

CONFIGURING HERITRIX

On the "Settings for sheet 'global'":

master

The host and port of the hbase master server.

table

Which table to crawl into. The table does need to exist, it will get auto-created.

only-new-records

Set to "false" by default. In default mode, heritrix will crawl all urls regardless of existing rowkeys (urls). By setting this to "true" you ensure that only new urls(rowkeys) are written to the crawl table. Also, if set to "true", heritrix doesnt download any content that is already existing as a record in the hbase table.

COMPILING THE SOURCE

The source is built using Maven2.

mvn clean compile

BUILDING THE JAR

mvn clean package

The hbase-writer-x.x.x.jar should be in the target/ directory.

OBTAINING THE DEPENDENCY JARS

You can get the hadoop (http://hadoop.apache.org/core/), hbase (http://hbase.org/) and log4j jars from downloading the releases from these 2 sites. You can also get the hadoop, hbase and log4j dependency jars from your ${HOME}/.m2/repository/ directory after you have built the project using maven. For example:

cp ${HOME}/.m2/repository/org/apache/hadoop/hbase/0.19.0/hbase-0.19.0.jar ${HERITRIX_HOME}/lib/

cp ${HOME}/.m2/repository/org/apache/hadoop/hadoop-core/0.19.0/hadoop-core-0.19.0.jar ${HERITRIX_HOME}/lib/
cp ${HOME}/.m2/repository/log4j/log4j/1.2.15/log4j-1.2.15.jar ${HERITRIX_HOME}/lib/

You can also get them by visiting the hbase-writer archive repository: http://repo1.opensourcemasters.org:8081/nexus/

UPGRADING TO NEW HADOOP/HBASE/HERITRIX VERSIONS

The hbase-writer project will be keeping up to date with the latest hadoop, hbase and heritrix versions, but if you would like to compile and build hbase-writer against different versions, you are able to do so with maven2 without changing any versioned files.

To build hbase-writer with new versions of hadoop, hbase or heritrix (or any of the dependencies), use a ${HOME}/.m2/settings.xml file.

A sample settings.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<settings>
<profiles>
<profile>
<id>
myBuild
</id>
<properties>
<heritrix.version>2.0.2</heritrix.version> <hbase.version>0.18.1</hbase.version> <hadoop.version>0.18.1</hadoop.version>
</properties>
</profile>
</profiles>
</settings>
Place this file in your ${HOME}/.m2/ directory and run the maven build command:
mvn clean package -PmyBuild
By typing the command:
mvn help:effective-pom -PmyBuild

you will get the resolved pom.xml dumped to stdout. Here you can verify that you are overriding the properties correctly.

BUILDING THE SITE REPORT

mvn clean site

PING BACK

Thanks to Questio for the time and support for allowing the release and maintenance of this project. (http://questio.com)


Sign in to add a comment
Hosted by Google Code