My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
FrequentlyAskedQuestions  

Featured
Updated Jan 1, 2012 by ganjisaffar@gmail.com

Frequently Asked Questions

  • Should I keep track of visited URLs to make sure that crawler4j doesn't crawl them twice? No, crawler4j manages that.
  • I have noticed that the crawler doesn't see that http://some.url.com/sub the same as http://some.url.com/sub/ and crawls them separately. These URLs are technically different. The first one can refer to a file called "sub". The second one can refer to http://some.url.com/sub/index.htm or something similar. So, web crawlers consider them as different. But if you're crawling a domain and you're sure that in that domain these two URLs are the same then you can add the required logic to your code.
  • I see "I/O exception (org.apache.http.NoHttpResponseException) caught when processing request: The target server failed to respond" in the logs. Is this an error? No. This means that the server that is serving the page you have requested has not been able to process your request. This can be a transient problem that might happen on the server because of the load on server. crawler4j retries several times for each URL for which this happens. But generally, in each crawl, there is a percentage of the pages that might not be downloaded because of the network or server problems. So, don't expect to see 100% of the requested pages in your results.
Comment by rahman...@gmail.com, Apr 22, 2012

Is it possible to download pdf/office-related files with this crawler?


Sign in to add a comment
Powered by Google Project Hosting