CRAWL-E is a web crawling framework that seamlessly supports distributed crawling across multiple threads as well as multiple machines.
CRAWL-E was designed to crawl the web fast fast as possible with as little development time as possible. It is only a framework, and requires the development of a Handler module in order to function properly.
The CRAWL-E developers are very familiar with how TCP and HTTP works and using that knowledge have written a web crawler intended to maximize TCP throughput. This benefit is realized when crawling web servers that utilize persistent HTTP connections as numerous requests will be made over a single TCP connection thus increasing the throughput.
Other features of CRAWL-E are multiple HTTP request method support, the most basic being GET, POST, PUT, DELETE, HEAD.
CRAWL-E has been utilized in the data collection of:
- Can Social Networks Improve e-Commerce: a Study on Social Marketplaces, Gayatri Swamynathan, Christo Wilson, Bryce Boe, Kevin C. Almeroth and Ben Y. Zhao, WOSN'08
- User Interactions in Social Networks and their Implications, Christo Wilson, Bryce Boe, Alessandra Sala, Krishna P. N. Puttaswamy and Ben Y. Zhao, EuroSys'09