My favorites | Sign in
Logo
                
Search
for
Updated Oct 25, 2009 by ryan.justin.smith
Labels: Featured
README  

This is a processor for heritrix that writes fetched pages to hbase.

Introduction

The layout of this contribution is modeled after Doug Judd's heritrix-hadoop-dfs-processor available off the heritrix home page.

This software is licensed under the LGPL. See accompanying LICENSE.txt document.

The hbase-writer is an extension to the Heritrix open source crawler written by the Internet Archive (http://crawler.archive.org/) that enables it to store crawled content directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://lucene.apache.org/hadoop/). HBase-Writer writes crawled content into a given hbase table as records. In turn, these tables are directly supported by the Map/Reduce framework via HBase so Map/Reduce jobs can be done on them.

The current version of hbase-writer assumes version 2.0.x of Heritrix, version 0.19.x of Hadoop/HBase and Java 1.6. Newer versions of Hadoop and Heritrix may continue to work with this connector as long as the pertinent APIs have not changed. Just replace the jar files with the newer versions.

SETUP

hbase-writer-x.x.x.jar
hbase-x.x.x.jar
zookeeper-x.x.x.jar
hadoop-x.x.x-core.jar
log4j-x.x.x.jar

CONFIGURING HERITRIX

On the "Settings for sheet 'global'":

zkquorum

The zookeeper quroum that serves the hbase master address. Since hbase-0.20.0, the master server's address is returned by the zookeeper quorum. So this value is a comma seperated list of the zk quorum. i.e.: zkHost1,zkHost2,zkHost3

table

Which table in HBase to write the crawl to. This table will be created automatically if it doesnt exist. i.e.: Webtable
write-only-new-records
Set to "false" by default. In default mode, heritrix will crawl all urls regardless of existing rowkeys (urls). By setting this to "true" you ensure that only new urls(rowkeys) are written to the crawl table.

process-only-new-records

Set to "false" by default. In default mode, heritrix will process (fetch and parse) all urls regardless of existing rowkeys (urls). By setting this to "true" you ensure that only new urls(rowkeys) are processed by heritrix. Also, if set to "true", heritrix doesnt download any content that is already existing as a record in the hbase table.

COMPILING THE SOURCE

The source is built using Maven2.

mvn clean compile

BUILDING THE JAR

mvn clean package

The hbase-writer-x.x.x.jar should be in the target/ directory.

OBTAINING THE DEPENDENCY JARS

You can get the hadoop (http://hadoop.apache.org/core/), hbase (http://hbase.org/) and log4j jars from downloading the releases from these 2 sites. You can also get the hadoop, hbase and log4j dependency jars from your ${HOME}/.m2/repository/ directory after you have built the project using maven. For example:

cp ${HOME}/.m2/repository/org/apache/hadoop/hbase/0.20.1/hbase-0.20.1.jar ${HERITRIX_HOME}/lib/ cp ${HOME}/.m2/repository/org/apache/hadoop/zookeeper/3.2.1/zookeeper-3.2.1.jar ${HERITRIX_HOME}/lib/ cp ${HOME}/.m2/repository/org/apache/hadoop/hadoop-core/0.20.1/hadoop-core-0.20.1.jar ${HERITRIX_HOME}/lib/ cp ${HOME}/.m2/repository/log4j/log4j/1.2.15/log4j-1.2.15.jar ${HERITRIX_HOME}/lib/

You can also get them by visiting the hbase-writer archive repository: http://repo1.opensourcemasters.org:8081/nexus/

UPGRADING TO NEW HADOOP/HBASE/HERITRIX VERSIONS

The hbase-writer project will be keeping up to date with the latest hadoop, hbase and heritrix versions, but if you would like to compile and build hbase-writer against different versions, you are able to do so with maven2 without changing any versioned files.

To build hbase-writer with new versions of hadoop, hbase or heritrix (or any of the dependencies), use a ${HOME}/.m2/settings.xml file.

A sample settings.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<settings>
<profiles>
<profile>
<id>
myBuild
</id>
<properties>
<heritrix.version>2.0.2</heritrix.version> <hbase.version>0.20.2</hbase.version> <hadoop.version>0.20.2</hadoop.version> <zookeeper.version>3.2.1</zookeeper.version>
</properties>
</profile>
</profiles>
</settings>
Place this file in your ${HOME}/.m2/ directory and run the maven build command:
mvn clean package -PmyBuild
By typing the command:
mvn help:effective-pom -PmyBuild

you will get the resolved pom.xml dumped to stdout. Here you can verify that you are overriding the properties correctly.

BUILDING THE SITE REPORT

mvn clean site



Sign in to add a comment
Hosted by Google Code