crowl

A Hadoop based web crawler and data mining tool

The project will focus on creating a scalable crawler with robust data mining as well as machine learning capabilities.

News

05th July 2011 : Crowl 0.11 Released

19th May 2011 : Crowl 0.1 Released

Introduction

At Present Crowl can only crawl RSS and ATOM feeds. It does not have any distributed capabilities. Also it is lacking any good machine learning functionality.

Crowl 0.11 has following features:

It can crawl a single feed url e.g. "http://news.google.com/news?pz=1&cf=all&ned=us&hl=en&output=rss"
It can crawl a collection of feed urls passed as a List of URLs
Gives HTML free content of a feed entry.
Gives Image urls in a feed
It can store the crawled feeds in MongoDB
It can store the image thumbnails found in feeds in a local directory specified through config.properties
config.properties file can be used to specify image thumbnail dimensions and mongoDB server properties

Getting Started

Find It Here

Future Work (0.2 release)

Provide Javadocs
Better Exception Handling
Implement url revisit policy based on feed change rate
Group similar feed items (partially implemented in 0.11)
Respect robot.txt policy (partially implemented in 0.11)
Support for focused crawling (topical crawling)
Inclusion of jsoup

<wiki:gadget url="http://www.ohloh.net/p/584905/widgets/project_basic_stats.xml" height="220" border="1"/>

Project Information

License: MIT License
8 stars
svn-based source control

Labels:
hadoop dataminig Machinelearning crawler