Google Search Appliance software version 6.0
Posted June 2009
Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter provides reference information about crawl administration tasks.
The following table lists Google Search Appliance crawl features. For each feature, the table lists the page in the Admin Console where you can use the feature and a reference to a section in this document that describes it.
| Feature | Admin Console Page | Reference |
|---|---|---|
| Always force recrawl URLs | Crawl and Index > Freshness Tuning | Freshness Tuning |
| Content statistics | Crawl and Index > Content Statistics | Using the Admin Console to Monitor a Crawl |
| Continuous crawl | Crawl and Index > Crawl Schedule | Selecting a Crawl Mode |
| Crawl diagnostics | Crawl and Index > Crawl Diagnostics | Using the Admin Console to Monitor a Crawl |
| Crawl frequently URLs | Crawl and Index > Freshness Tuning | Freshness Tuning |
| Crawl infrequently URLs | Crawl and Index > Freshness Tuning | Freshness Tuning |
| Crawl modes | Crawl and Index > Crawl Schedule | Selecting a Crawl Mode |
| Crawl queue snapshots | Crawl and Index > Crawl Queue | Using the Admin Console to Monitor a Crawl |
| Crawl schedule | Crawl and Index > Crawl Schedule | Scheduling a Crawl |
| Crawl status | Crawl and Index > Crawl Status | Using the Admin Console to Monitor a Crawl |
| Crawl URLs | Crawl and Index > Crawl URLs | Configuring a Crawl |
| Do not crawl URLs | Crawl and Index > Crawl URLs | Configuring a Crawl |
| Document dates | Crawl and Index > Document Dates | Defining Document Date Rules |
| Duplicate hosts | Crawl and Index > Duplicate Hosts | Preventing Crawling of Duplicate Hosts |
| Follow and crawl only URLs | Crawl and Index > Crawl URLs | Configuring a Crawl |
| Freshness tuning | Crawl and Index > Freshness Tuning | Freshness Tuning |
| Host load exceptions | Crawl and Index > Host Load Schedule | Configuring Web Server Host Load Schedules |
| Host load schedule | Crawl and Index > Host Load Schedule | Configuring Web Server Host Load Schedules |
| HTTP headers | Crawl and Index > HTTP Headers | Identifying the User Agent Name |
| Maximum number of URLs to crawl | Crawl and Index > Host Load Schedule | Configuring Web Server Host Load Schedules |
| Proxy servers | Crawl and Index > Proxy Servers | Crawling Over Proxy Servers |
| Recrawl URLs | Crawl and Index > Freshness Tuning | Freshness Tuning |
| Scheduled crawl | Crawl and Index > Crawl Schedule | Selecting a Crawl Mode |
| Start crawling from the following URLs | Crawl and Index > Crawl URLs | Configuring a Crawl |
| Web server host load | Crawl and Index > Host Load Schedule | Configuring Web Server Host Load Schedules |
The following table lists Google Search Appliance crawl administration tasks. For each task, the table gives a reference to a section in this document that describes it, as well as the page in the Admin Console that you use to accomplish the task.
| Task | Reference | Admin Console Page |
|---|---|---|
| Prepare your data for crawling: robots.txt, Robots META tags, googleoff/googleon tags, no_crawl directories, shared folders, and jump pages | Preparing Data for a Crawl | |
| Setup the crawl path: start URLs, follow and crawl URLs, do not crawl URLs | Configuring a Crawl | Crawl and Index > Crawl URLs |
| Test URL patterns in the crawl path | Testing Your URL Patterns | |
| Select a crawl mode: continuous crawl or scheduled crawl | Selecting a Crawl Mode | Crawl and Index > Crawl Schedule |
| Schedule a crawl | Scheduling a Crawl | |
| Configure a continuous crawl: URLs to crawl frequently, URLs to crawl infrequently, URLs to always force recrawl | Freshness Tuning | Crawl and Index > Freshness Tuning |
| Pause or restart a continuous crawl | Stopping, Pausing, or Resuming a Crawl | Status and Reports > Crawl Status |
| Stop a scheduled crawl | ||
| Submit a URL to be recrawled | Freshness Tuning | Crawl and Index > Freshness Tuning |
| Submitting a URL for Recrawl | Status and Reports > Crawl Diagnostics | |
| Set up proxies for Web servers | Crawling Over Proxy Servers | Crawl and Index > Proxy Servers |
| Locate or change the user-agent name | Identifying the User Agent Name | Crawl and Index > HTTP Headers |
| Enter additional HTTP headers for the search appliance crawler to use | ||
| Prevent recrawling of content that resides on duplicate hosts | Preventing Crawling of Duplicate Hosts | Crawl and Index > Duplicate Hosts |
| Define rules for the search appliance crawler to use as it indexes documents | Defining Document Date Rules | Crawl and Index > Document Dates |
| Specify the maximum number of URLs to crawl for a host and the average number of concurrent connections to open to each Web server for crawling | Configuring Web Server Host Load Schedules | Crawl and Index > Host Load Schedule |
| View the current crawl mode and summary information about events of the past 24 hours in a crawl | Using the Admin Console to Monitor a Crawl | Status and Reports > Crawl Status |
| View crawl history for all hosts, a specific host, or a specific file | Status and Reports > Crawl Diagnostics | |
| Define and view a snapshot of uncrawled URLs in the crawl queue | Status and Reports > Crawl Queue | |
| View summary information about files that have been crawled | Status and Reports > Content Statistics | |
| View current license information | What Is the Search Appliance License Limit? | Administration > License |
The following table lists Google Search Appliance Admin Console pages that are used to administer a crawl. For each Admin Console page, the table provides a reference to a section in this document that describes using the page.
| Admin Console Page | Reference |
|---|---|
| Crawl and Index > Crawl URLs | Configuring a Crawl |
| Crawl and Index > Crawl Schedule | Selecting a Crawl Mode |
| Scheduling a Crawl | |
| Crawl and Index > Proxy Servers | Crawling over Proxy Servers |
| Crawl and Index > HTTP Headers | Identifying the User Agent |
| Crawl and Index > Duplicate Hosts | Preventing Crawling of Duplicate Hosts |
| Crawl and Index > Document Dates | Defining Document Date Rules |
| Crawl and Index > Host Load Schedule | Configuring Web Server Host Load Schedules |
| Crawl and Index > Freshness Tuning | Freshness Tuning |
| Status and Reports > Crawl Status | Using the Admin Console to Monitor a Crawl |
| Status and Reports > Crawl Diagnostics | |
| Status and Reports > Crawl Queue | |
| Status and Reports > Content Statistics |