Administering Crawl for Web and File Share Content: Crawl Quick Reference

Google Search Appliance software version 4.6
Google Mini software version 4.6
Posted July 2007

This document provides an overview of how the Google Search Appliance and the Google Mini crawl and index enterprise content.

For the Google Search Appliance, information about continuous crawl applies to software version 4.2, and information about full crawl and file system crawl applies to software version 4.6 and later.

For the Google Mini, all information applies to software version 4.4 and later.

Contents

  1. Crawl Administration Tasks
  2. Admin Console Crawl Pages

Crawl Administration Tasks

The following table lists search appliance crawl administration tasks. For each task, the table gives a reference to a section in this document that describes it, as well as the page in the Admin Console that you use to accomplish the task.

Task Reference Admin Console Page
Prepare your data for crawling: robots.txt, Robots META tags, googleoff/googleon tags, no_crawl directories, shared folders, and jump pages Preparing Data for a Crawl  
Setup the crawl path: start URLs, follow and crawl URLs, do not crawl URLs Configuring a Crawl Crawl and Index > Crawl URLs
Test URL patterns in the crawl path Testing Your URL Patterns
Select a crawl mode: continuous crawl or full crawl Selecting a Crawl Mode Crawl and Index > Crawl Schedule
Schedule a full crawl Scheduling a Full Crawl
Configure a continuous crawl: URLs to crawl frequently, URLs to crawl infrequently, URLs to always force recrawl Freshness Tuning Crawl and Index > Freshness Tuning
Pause or restart a continuous crawl Stopping, Pausing, or Resuming a Crawl Status and Reports > Crawl Status
Stop a full crawl
Submit a URL to be recrawled Freshness Tuning Crawl and Index > Freshness Tuning
Submitting a URL for Recrawl Status and Reports > Crawl Diagnostics
Set up proxies for Web servers Crawling Over Proxy Servers Crawl and Index > Proxy Servers
Locate or change the user-agent name Identifying the User Agent Name Crawl and Index > HTTP Headers
Enter additional HTTP headers for the search appliance crawler to use
Prevent recrawling of content that resides on duplicate hosts Preventing Crawling of Duplicate Hosts Crawl and Index > Duplicate Hosts
Define rules for the search appliance crawler to use as it indexes documents Defining Document Date Rules Crawl and Index > Document Dates
Specify the maximum number of URLs to crawl for a host and the average number of concurrent connections to open to each Web server for crawling Configuring Web Server Host Load Schedules Crawl and Index > Host Load Schedule
View the current crawl mode and summary information about events of the past 24 hours in a crawl Using the Admin Console to Monitor a Crawl Status and Reports > Crawl Status
View crawl history for all hosts, a specific host, or a specific file Status and Reports > Crawl Diagnostics
Define and view a snapshot of uncrawled URLs in the crawl queue Status and Reports > Crawl Queue
View summary information about files that have been crawled Status and Reports > Content Statistics
View current license information What Is the Search Appliance License Limit? Administration > License

 

Back to top

Admin Console Crawl Pages

The following table lists search appliance Admin Console pages that are used to administer a crawl. For each Admin Console page, the table provides a reference to a section in this document that describes using the page.

Admin Console Page Reference
Crawl and Index > Crawl URLs Configuring a Crawl
Crawl and Index > Crawl Schedule Selecting a Crawl Mode
Scheduling a Full Crawl
Crawl and Index > Proxy Servers Crawling over Proxy Servers
Crawl and Index > HTTP Headers Identifying the User Agent
Crawl and Index > Duplicate Hosts Preventing Crawling of Duplicate Hosts
Crawl and Index > Document Dates Defining Document Date Rules
Crawl and Index > Host Load Schedule Configuring Web Server Host Load Schedules
Crawl and Index > Freshness Tuning Freshness Tuning
Status and Reports > Crawl Status Using the Admin Console to Monitor a Crawl
Status and Reports > Crawl Diagnostics
Status and Reports > Crawl Queue
Status and Reports > Content Statistics

 

Back to top

Last modified:

Updated on