Enhance parallel execution #16

nicosensei · 2013-09-22T12:39:50Z

Right now we have a full crawl workflow in a crawler thread:

Fetch a number of URLs from the frontier
For each candidate URL, process it:
- HTTP GET
- test if should visit
- if so parse
- extract outgoing links and schedule them
- visit; e.g. process payload

I believe separating the different steps of the URL processing would enhance the crawl speed (that's basically the approach Internet Archive's Heritrix takes):

Have configurable thread pools for:

executing the HTTP request
parsing
link extraction and frontier scheduling
visiting

This involves having a shared component to store the HTTP response contents between the moment they are downloaded and the moment they have been visited. My initial guess is that Berkeley DB looks like a damn good candidate ;-)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance parallel execution #16

Enhance parallel execution #16

nicosensei commented Sep 22, 2013

Enhance parallel execution #16

Enhance parallel execution #16

Comments

nicosensei commented Sep 22, 2013