Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Cookie Sites
  Forms Authentication
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Index Rollback
  Freshness Tuning
  Collections

Serving

Status and Reports

Administration

More Information

Crawl and Index > Host Load Schedule

Maximum Number of URLs to Crawl

Your license specifies the maximum number of URLs you can crawl. However, you can specify a smaller maximum number of URLs you wish to crawl if you do not yet have as many URLs as your license stipulates. You can improve system performance if you enter a number that is less than the maximum overall pages specified by the license. After you click the Save Schedule and Host Loads button, the system will crawl up to approximately 10% over the number you specified. The system crawls slightly more URLs, so that after it eliminates duplicates, the number of pages closely matches the maximum you specified.

Note: If you leave this box blank, the system continuously crawls URLs to the limit of your license.

Web Server Host Load

The Web Server Host Load value specifies the maximum number of concurrent connections open on every web server for crawling. We recommend you start with a value of 4 connections and then gradually increase the value only when you are confident your web servers can handle the load you specify. Check with the webmaster whose sites you crawl if you are uncertain of a web server's load capacity.

The appliance handles host load differently for file servers and for web servers behind a proxy server. In these cases, the appliance treats multiple servers as a single host and applies one host load setting to all of them. For example, a host load setting of 4 in an environment of 10 file servers opens connections to no more than four servers at a time, crawling the entire 10 servers in the order specified by your crawl queue.

Warning: Some servers may not be able to handle a high load.

If the crawler deems that a server cannot handle the host load defined, it reduces the crawl rate until an acceptable response time is achieved.

Note: The number of concurrent connections may occasionally be lower than the number you specify here, depending on your system activity. The system attempts to maintain this number.

Exceptions to Web Server Host Load

Exceptions to Web Server Host Load lets you specify exceptions for web server host loads by assigning different maximum host loads for specified web servers. For time periods when you do not specify a host load exception, the default web server host load will apply.

For example, you may have three web servers that can handle more crawl load during the night. For these three web servers, you can specify a higher load than the default host load setting of 4 for 12 a.m. to 6 a.m.

To minimize the host load on servers during the day, you might set an exceptional value of 0 between 9:00 a.m. and 5:00 p.m., when the servers cannot handle the extra load.

The host name you enter must be a fully qualified host name, and it can either be the ASCII or IP address.

When sites are crawled using a proxy, the same host load is used to crawl all sites behind the proxy. The host load used will be the maximum host load specified for any URL pattern crawled using the proxy. You should do one of the following:

  • Specify no host load for sites that you wish to crawl using the proxy, in which case the maximum host load is used.
  • Specify a host load that is small enough so as not to affect the performance of any proxied sites.

The following rules also apply to entries on this page:

  • Only one host name entry is permitted per line.
  • A host load of zero (0) means that the crawler will access the server only a few times per hour.
  • You may specify the load factor as decimal value, for example: .5, 1, 2.0

    Note: A value of 2 indicates that, on average, only two concurrent connections per host are used. Therefore, a value of .25 indicates that, on average, only 25% of the time a connection to the web server is used.


 

 
© Google Inc. 2007