Show all
Featured downloads:
crawler4j-1.0.5.jar crawler4j-dependencies-lib.zip crawler4j-example-src-1.0.4.zip
crawler4j-1.0.5.jar crawler4j-dependencies-lib.zip crawler4j-example-src-1.0.4.zip
Crawler4j is a Java library which provides a simple interface for crawling the web. Using it, you can setup a multi-threaded web crawler in 5 minutes!
Sample Usage
First, you need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:
import java.util.ArrayList;
import java.util.regex.Pattern;
import edu.uci.ics.crawler4j.crawler.Page;
import edu.uci.ics.crawler4j.crawler.WebCrawler;
import edu.uci.ics.crawler4j.url.WebURL;
public class MyCrawler extends WebCrawler {
Pattern filters = Pattern.compile(".*(\\.(css|js|bmp|gif|jpe?g"
+ "|png|tiff?|mid|mp2|mp3|mp4"
+ "|wav|avi|mov|mpeg|ram|m4v|pdf"
+ "|rm|smil|wmv|swf|wma|zip|rar|gz))$");
public MyCrawler() {
}
public boolean shouldVisit(WebURL url) {
String href = url.getURL().toLowerCase();
if (filters.matcher(href).matches()) {
return false;
}
if (href.startsWith("http://ics.uci.edu/")) {
return true;
}
return false;
}
public void visit(Page page) {
int docid = page.getWebURL().getDocid();
String url = page.getWebURL().getURL();
String text = page.getText();
ArrayList<WebURL> links = page.getURLs();
}
}
As can be seen in the above code, there are two main functions that should be overridden:
- shouldVisit: This function decides whether the given URL should be crawled or not.
- visit: This function is called after the content of a URL is downloaded successfully. You can easily get the text, links, url and docid of the downloaded page.
You should also implement a controller class which specifies the seeds of the crawl, the folder in which crawl data should be stored and number of concurrent thread:
import edu.uci.ics.crawler4j.crawler.CrawlController;
public class Controller {
public static void main(String[] args) throws Exception {
CrawlController controller = new CrawlController("/data/crawl/root");
controller.addSeed("http://ics.uci.edu/");
controller.start(MyCrawler.class, 10);
}
}Dependencies
The following libraries are used in the implementation of crawler4j. In order to make life easier all of them are bundled in the package: