Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Cookie Sites
  Forms Authentication
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Index Rollback
  Freshness Tuning
  Collections

Serving

Status and Reports

Administration

More Information

Crawl and Index > HTTP Headers

User Agent Name

The gsa-crawler is the search appliance robot that performs the crawling on a web site. The crawler identifies itself with every page it downloads from any web server by specifying a user agent that can be stored in a web server log file by webmasters.

The identifier used by the crawler consists of:

  • The user agent name, which, by default, is set to gsa-crawler.
  • A unique identifier that is assigned for each search appliance.
  • The problem email address you entered in Administration > System Settings.

If you keep the user agent name gsa-crawler, the accessed web servers might see an identifier such as

    gsa-crawler (Enterprise; GID01065; yourname@yourcompany.com)

The email is a required part of the identification to allow webmasters to contact you if the search appliance affects them negatively by crawling their sites too rapidly.

There may be pages or sites in your organization that you do not want the search appliance to crawl, such as password-protected directories with information that you want to keep private. To prevent the gsa-crawler from accessing the information on these servers, you can either:

  • Enter their URL patterns in Do Not Crawl URLs with the Following Patterns
  • Create and put a robots.txt file in the root of the server. A robots.txt file consists of the user-agent name and one or more lines of instruction for the robot.

    For example:

    # /robots.txt file for gsa-crawler (This is a comment line.)
    User-agent: gsa-crawler (This names the user-agent that the file targets.)
    Disallow: /*.cgi (The gsa-crawler will not be allowed to crawl any CGI files.)
    Disallow: /*.pl (The gsa-crawler will not be allowed to crawl any Perl scripts.)
    Allow: /$ (The gsa-crawler is allowed to crawl everything else.)
    Disallow: / (This prevents the gsa-crawler from crawling anything on the site.)

For more information, see the resource A Standard for Robot Exclusion, which explains robots.txt files in detail.

Additional HTTP Headers for Crawler

Specify HTTP headers that will be included in all HTTP requests made during crawling.

The HTTP headers specified in this window must follow the formats specified by
http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.2
and
http://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.3.

Two examples of valid HTTP headers are Authorization and Proxy-Authorization. Be sure to read about them before using.

Caution: Certain HTTP headers are used by the crawler for its normal operation (such as Host, Connection, Accept, From, User-Agent, etc.). Any values entered here for these headers will overwrite the crawler's standard headers and may cause undesired operation.

You may use nonstandard headers that enable passing certain information your servers may require, but make sure that all nonstandard headers are valid for your servers. Otherwise, search results may be returned in an unpredictable manner.

To specify additional HTTP headers:

  1. Click Crawl and Index and then click HTTP Headers.
  2. In the Additional HTTP Headers for Crawler box, enter a new header.
  3. To add more headers, press Enter to start a new line.
  4. After all the headers are specified, click the Update Header Settings button.

Example header:

Authorization: Basic c29tZXVzZXI6c29tZXBhc3M=


 
© Google Inc. 2007