My favorites | Sign in
Project Logo
       

CRAWL-E is a web crawling framework that seamlessly supports distributed crawling across multiple threads as well as multiple machines.

CRAWL-E was designed to crawl the web fast fast as possible with as little development time as possible. It is only a framework, and requires the development of a Handler module in order to function properly.

The CRAWL-E developers are very familiar with how TCP and HTTP works and using that knowledge have written a web crawler intended to maximize TCP throughput. This benefit is realized when crawling web servers that utilize persistent HTTP connections as numerous requests will be made over a single TCP connection thus increasing the throughput.

Other features of CRAWL-E are multiple HTTP request method support, the most basic being GET, POST, PUT, DELETE, HEAD.

CRAWL-E has been utilized in the data collection of:









Hosted by Google Code