Google Search Appliance software version 4.6
Google Mini software version 4.6
Posted July 2007
This document provides an overview of how the Google Search Appliance and the Google Mini crawl and index enterprise content.
For the Google Search Appliance, information about continuous crawl applies to software version 4.2, and information about full crawl and file system crawl applies to software version 4.6 and later.
For the Google Mini, all information applies to software version 4.4 and later.
The following table lists search appliance crawl administration tasks. For each task, the table gives a reference to a section in this document that describes it, as well as the page in the Admin Console that you use to accomplish the task.
| Task | Reference | Admin Console Page |
|---|---|---|
| Prepare your data for crawling: robots.txt, Robots META tags, googleoff/googleon tags, no_crawl directories, shared folders, and jump pages | Preparing Data for a Crawl | |
| Setup the crawl path: start URLs, follow and crawl URLs, do not crawl URLs | Configuring a Crawl | Crawl and Index > Crawl URLs |
| Test URL patterns in the crawl path | Testing Your URL Patterns | |
| Select a crawl mode: continuous crawl or full crawl | Selecting a Crawl Mode | Crawl and Index > Crawl Schedule |
| Schedule a full crawl | Scheduling a Full Crawl | |
| Configure a continuous crawl: URLs to crawl frequently, URLs to crawl infrequently, URLs to always force recrawl | Freshness Tuning | Crawl and Index > Freshness Tuning |
| Pause or restart a continuous crawl | Stopping, Pausing, or Resuming a Crawl | Status and Reports > Crawl Status |
| Stop a full crawl | ||
| Submit a URL to be recrawled | Freshness Tuning | Crawl and Index > Freshness Tuning |
| Submitting a URL for Recrawl | Status and Reports > Crawl Diagnostics | |
| Set up proxies for Web servers | Crawling Over Proxy Servers | Crawl and Index > Proxy Servers |
| Locate or change the user-agent name | Identifying the User Agent Name | Crawl and Index > HTTP Headers |
| Enter additional HTTP headers for the search appliance crawler to use | ||
| Prevent recrawling of content that resides on duplicate hosts | Preventing Crawling of Duplicate Hosts | Crawl and Index > Duplicate Hosts |
| Define rules for the search appliance crawler to use as it indexes documents | Defining Document Date Rules | Crawl and Index > Document Dates |
| Specify the maximum number of URLs to crawl for a host and the average number of concurrent connections to open to each Web server for crawling | Configuring Web Server Host Load Schedules | Crawl and Index > Host Load Schedule |
| View the current crawl mode and summary information about events of the past 24 hours in a crawl | Using the Admin Console to Monitor a Crawl | Status and Reports > Crawl Status |
| View crawl history for all hosts, a specific host, or a specific file | Status and Reports > Crawl Diagnostics | |
| Define and view a snapshot of uncrawled URLs in the crawl queue | Status and Reports > Crawl Queue | |
| View summary information about files that have been crawled | Status and Reports > Content Statistics | |
| View current license information | What Is the Search Appliance License Limit? | Administration > License |
The following table lists search appliance Admin Console pages that are used to administer a crawl. For each Admin Console page, the table provides a reference to a section in this document that describes using the page.
| Admin Console Page | Reference |
|---|---|
| Crawl and Index > Crawl URLs | Configuring a Crawl |
| Crawl and Index > Crawl Schedule | Selecting a Crawl Mode |
| Scheduling a Full Crawl | |
| Crawl and Index > Proxy Servers | Crawling over Proxy Servers |
| Crawl and Index > HTTP Headers | Identifying the User Agent |
| Crawl and Index > Duplicate Hosts | Preventing Crawling of Duplicate Hosts |
| Crawl and Index > Document Dates | Defining Document Date Rules |
| Crawl and Index > Host Load Schedule | Configuring Web Server Host Load Schedules |
| Crawl and Index > Freshness Tuning | Freshness Tuning |
| Status and Reports > Crawl Status | Using the Admin Console to Monitor a Crawl |
| Status and Reports > Crawl Diagnostics | |
| Status and Reports > Crawl Queue | |
| Status and Reports > Content Statistics |
Last modified:
Updated on