|
README
This is a processor for heritrix that writes fetched pages to hbase. IntroductionThe layout of this contribution is modeled after Doug Judd's heritrix-hadoop-dfs-processor available off the heritrix home page. This software is licensed under the LGPL. See accompanying LICENSE.txt document. The hbase-writer is an extension to the Heritrix open source crawler written by the Internet Archive (http://crawler.archive.org/) that enables it to store crawled content directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://lucene.apache.org/hadoop/). HBase-Writer writes crawled content into a given hbase table as records. In turn, these tables are directly supported by the Map/Reduce framework via HBase so Map/Reduce jobs can be done on them. The current version of hbase-writer assumes version 2.0.x of Heritrix, version 0.19.x of Hadoop/HBase and Java 1.6. Newer versions of Hadoop and Heritrix may continue to work with this connector as long as the pertinent APIs have not changed. Just replace the jar files with the newer versions. SETUP
hbase-writer-x.x.x.jar hbase-x.x.x.jar zookeeper-x.x.x.jar hadoop-x.x.x-core.jar log4j-x.x.x.jar
CONFIGURING HERITRIXOn the "Settings for sheet 'global'":
zkquorum The zookeeper quroum that serves the hbase master address. Since hbase-0.20.0, the master server's address is returned by the zookeeper quorum. So this value is a comma seperated list of the zk quorum. i.e.: zkHost1,zkHost2,zkHost3 table Which table in HBase to write the crawl to. This table will be created automatically if it doesnt exist. i.e.: Webtablewrite-only-new-records Set to "false" by default. In default mode, heritrix will crawl all urls regardless of existing rowkeys (urls). By setting this to "true" you ensure that only new urls(rowkeys) are written to the crawl table. process-only-new-records Set to "false" by default. In default mode, heritrix will process (fetch and parse) all urls regardless of existing rowkeys (urls). By setting this to "true" you ensure that only new urls(rowkeys) are processed by heritrix. Also, if set to "true", heritrix doesnt download any content that is already existing as a record in the hbase table. COMPILING THE SOURCEThe source is built using Maven2. mvn clean compile BUILDING THE JARmvn clean package The hbase-writer-x.x.x.jar should be in the target/ directory. OBTAINING THE DEPENDENCY JARSYou can get the hadoop (http://hadoop.apache.org/core/), hbase (http://hbase.org/) and log4j jars from downloading the releases from these 2 sites. You can also get the hadoop, hbase and log4j dependency jars from your ${HOME}/.m2/repository/ directory after you have built the project using maven. For example: cp ${HOME}/.m2/repository/org/apache/hadoop/hbase/0.20.1/hbase-0.20.1.jar ${HERITRIX_HOME}/lib/ cp ${HOME}/.m2/repository/org/apache/hadoop/zookeeper/3.2.1/zookeeper-3.2.1.jar ${HERITRIX_HOME}/lib/ cp ${HOME}/.m2/repository/org/apache/hadoop/hadoop-core/0.20.1/hadoop-core-0.20.1.jar ${HERITRIX_HOME}/lib/ cp ${HOME}/.m2/repository/log4j/log4j/1.2.15/log4j-1.2.15.jar ${HERITRIX_HOME}/lib/ You can also get them by visiting the hbase-writer archive repository: http://repo1.opensourcemasters.org:8081/nexus/ UPGRADING TO NEW HADOOP/HBASE/HERITRIX VERSIONSThe hbase-writer project will be keeping up to date with the latest hadoop, hbase and heritrix versions, but if you would like to compile and build hbase-writer against different versions, you are able to do so with maven2 without changing any versioned files. To build hbase-writer with new versions of hadoop, hbase or heritrix (or any of the dependencies), use a ${HOME}/.m2/settings.xml file. A sample settings.xml file: <?xml version="1.0" encoding="UTF-8"?>Place this file in your ${HOME}/.m2/ directory and run the maven build command: mvn clean package -PmyBuildBy typing the command: mvn help:effective-pom -PmyBuild you will get the resolved pom.xml dumped to stdout. Here you can verify that you are overriding the properties correctly. BUILDING THE SITE REPORTmvn clean site
|
Sign in to add a comment