You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now we have a full crawl workflow in a crawler thread:
Fetch a number of URLs from the frontier
For each candidate URL, process it:
HTTP GET
test if should visit
if so parse
extract outgoing links and schedule them
visit; e.g. process payload
I believe separating the different steps of the URL processing would enhance the crawl speed (that's basically the approach Internet Archive's Heritrix takes):
Have configurable thread pools for:
executing the HTTP request
parsing
link extraction and frontier scheduling
visiting
This involves having a shared component to store the HTTP response contents between the moment they are downloaded and the moment they have been visited. My initial guess is that Berkeley DB looks like a damn good candidate ;-)
The text was updated successfully, but these errors were encountered:
Right now we have a full crawl workflow in a crawler thread:
I believe separating the different steps of the URL processing would enhance the crawl speed (that's basically the approach Internet Archive's Heritrix takes):
Have configurable thread pools for:
This involves having a shared component to store the HTTP response contents between the moment they are downloaded and the moment they have been visited. My initial guess is that Berkeley DB looks like a damn good candidate ;-)
The text was updated successfully, but these errors were encountered: