Administering Crawl for Web and File Share Content: Advanced Topics

Google Search Appliance software version 4.6
Google Mini software version 4.6
Posted July 2007

This document provides an overview of how the Google Search Appliance and the Google Mini crawl and index enterprise content.

For the Google Search Appliance, information about continuous crawl applies to software version 4.2, and information about full crawl and file system crawl applies to software version 4.6 and later.

For the Google Mini, all information applies to software version 4.4 and later.

Contents

  1. Identifying the User Agent
    1. User Agent Name
    2. User Agent Email Address
  2. Freshness Tuning
  3. Crawling over Proxy Servers
  4. Preventing Crawling of Duplicate Hosts
  5. Configuring Web Server Host Load Schedules
  6. Removing Documents from the Index
  7. Using Collections
    1. Default Collection
    2. Changing URL Patterns in a Collection
  8. Forcing Crawling of URLs in Javascript

    Back to top

Identifying the User Agent

Web servers see various client applications, including Web browsers and the search appliance crawler, as "user agents." When the search appliance crawler visits a Web server, the crawler identifies itself to the server by its User-Agent identifier, which is sent as part of the HTTP request.

The User-Agent identifier includes all of the following elements:

User Agent Name

The default user agent name for both the Google Search Appliance and the Google Mini is "gsa-crawler." In a Web server's logs, the server administrator can identify each visit by the search appliance crawler to a Web server by this user agent name.

You can view or change the User-Agent name or enter additional HTTP headers for the search appliance crawler to use with the Crawl and Index > HTTP Headers page in the Admin Console.

User Agent Email Address

Including an email address in the User-Agent identifier enables a webmaster to contact the search appliance administrator in case the site is adversely affected by crawling that is too rapid, or if the webmaster does not want certain pages crawled at all. The email address is a required element of the search appliance User-Agent identifier.

For complete information about the Crawl and Index > HTTP Headers page, click Help Center > Crawl and Index > HTTP Headers in the Admin Console.

Back to top

Freshness Tuning

You can improve the performance of a continuous crawl using URL patterns on the Crawl and Index > Freshness Tuning page in the Admin Console. The Crawl and Index > Freshness Tuning page provides four categories of URL patterns, as described in the following table.

URL Pattern Description
Crawl Frequently Use Crawl Frequently patterns for URLs that are dynamic and change frequently. You can use the Crawl Frequently patterns to give hints to the search appliance crawler during the early stages of crawling, before the search appliance has a history of how frequently URLs actually change. Any URL that matches one of the Crawl Frequently patterns is scheduled to be recrawled at least once every day. The maximum rate at which a URL can be scheduled to be recrawled is once every 15 minutes. In other words, the minimum wait time is 15 minutes. For this to happen, the URL's content must change at least every 30 minutes. If you have too many URLs in Crawl Frequently patterns, wait time will increase.
Crawl Infrequently Use Crawl Infrequently Patterns for URLs that are relatively static and do not change frequently. Any URL that matches one of the Crawl Infrequently patterns is not crawled more than once every 90 days, regardless of its Enterprise PageRank or how frequently it changes. You can use this feature for Web pages that do not change and do not need to be recrawled. You can also use it for Web pages where a small part of their content changes frequently, but the important parts of their content does not change.
Always Force Recrawl Use Always Force Recrawl patterns to prevent the search appliance from crawling a URL from cache.
Recrawl these URL Patterns Use Recrawl these URL Patterns to submit a URL to be recrawled. URLs that you enter here are recrawled as soon as possible.

For complete information about the Crawl and Index > Freshness Tuning page, click Help Center > Crawl and Index > Freshness Tuning in the Admin Console.

Crawling over Proxy Servers

If you want the search appliance to crawl outside your internal network and include the crawled data in your index, use the Crawl and Index > Proxy Servers page in the Admin Console. For complete information about the Crawl and Index > Proxy Servers page, click Help Center>Proxy Servers in the Admin Console.

Back to top

Preventing Crawling of Duplicate Hosts

Many organizations have mirrored servers or duplicate hosts for such purposes as production, testing, and load balancing. Mirrored servers are also the case where multiple aliases are used or a Web site has changed names, which usually occurs when companies or departments merge.

Disadvantages of allowing the search appliance to recrawl content on mirrored servers include:

To prevent crawling of duplicate hosts, you can specify one or more "canonical," or standard, hosts using the Crawl and Index > Duplicate Hosts page.

For complete information about the Crawl and Index > Duplicate Hosts page, click Help Center > Crawl and Index > Duplicate Hosts in the Admin Console.

Configuring Web Server Host Load Schedules

A Web server can handle several concurrent requests from the search appliance. The number of concurrent requests is known as the Web server's "host load." If the search appliance is crawling through a proxy, the host load limits the maximum number of concurrent connections that can be made through the proxy. The default number of concurrent requests is four.

Increasing the host load can speed up the crawl rate, but it also puts more load on your Web servers. It is recommended that you experiment with the host load settings at off-peak time or in controlled environments so that you can monitor the effect it has on your Web servers.

To configure a Web server host load schedule, use the Crawl and Index > Hostload Schedule page.

For complete information about the Crawl and Index > Host Load Schedule page, click Help Center > Crawl and Index > Host Load Schedule in the Admin Console.

Back to top

Removing Documents from the Index

To remove a document from the index, add the full URL of the document to Do Not Crawl URLs on the Crawl and Index > Crawl URLs page in the Admin Console.

Using Collections

Collections are subsets of the index used to serve different search results to different users. For example, a collection can be organized by geography, product, job function, and so on. Collections can overlap, so one document can be relevant to several different collections, depending on its content. Collections also allow users to search targeted content more quickly and efficiently than searching the entire index.

For information about using the Crawl and Index > Collections page to create and manage collections, click Help Center > Crawl and Index > Collections in the Admin Console.

Default Collection

During initial crawling, the search appliance establishes the default_collection, which contains all crawled content. You can redefine the default_collection but it is not advisable to do this because crawl diagnostics are organized by collection. Troubleshooting using the Status and Reports > Crawl Diagnostics page becomes much harder if you cannot see all URLs crawled.

Changing URL Patterns in a Collection

Documents that are added to the index receive a tag for each collection whose URL patterns they match. If you change the URL patterns for a collection, the search appliance immediately starts a process that runs across all the crawled URLs and retags them according to the change in the URL patterns. This process usually completes in a few minutes but can take up to an hour for heavily-loaded appliances. Search results for the collection are corrected after the process finishes.

Back to top

Forcing Crawling of URLs in Javascript

The search appliance does not crawl URLs contained within Javascript code. In the following example format, the link products.html is contained within the javascript document.write:

<script language="JavaScript">;
  document.write("<a href=products.html>html code here</a>");
</script>

If your enterprise content relies on menus driven by Javascript, use jump pages or basic HTML site maps to force crawling of URLs in Javascript. For example, to force crawling of products.html, you need to list it on a jump page or site map.

Back to top

Last modified:

Updated on