My favorites | Sign in
Project Logo
          
Code license: Apache License 2.0
Labels: Crawler, Java, WebCrawler
People details
Project owners:
  ganjisaffar

Crawler4j is a Java library which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes!

Sample Usage

First, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:

import java.util.ArrayList;
import java.util.regex.Pattern;

import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;

public class MyCrawler extends WebCrawler {

	Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
		+ "|png|tiff?|mid|mp2|mp3|mp4"
		+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
		+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
	
	public MyCrawler() {
	}

	public boolean shouldVisit(WebURL url) {
		String href = url.getURL().toLowerCase();
		if (filters.matcher(href).matches()) {
			return false;
		}
		if (href.startsWith("http://ics.uci.edu/")) {
			return true;
		}
		return false;
	}

	public void visit(Page page) {
		int docid = page.getWebURL().getDocid();
		String url = page.getWebURL().getURL();		
		String text = page.getText();
		ArrayList<WebURL> links = page.getURLs();		
	}
}

As can be seen in the above code, there are two main functions that should be overridden:

You should also implement a controller class which specifies the seeds of the crawl, the folder in which crawl data should be stored and number of concurrent thread:

import edu.uci.ics.crawler4j.crawler.CrawlController;

public class Controller {
	public static void main(String[] args) throws Exception {
		CrawlController controller = new CrawlController("/data/crawl/root");
		controller.addSeed("http://ics.uci.edu/");
		controller.start(MyCrawler.class, 10);	
	}
}

Dependencies

The following libraries are used in the implementation of crawler4j. In order to make life easier all of them are bundled in the package:









Hosted by Google Code