My favorites | English | Sign in

Faster apps faster - GWT 2.0 with Speed Tracer New!

Google Search Appliance

Administering Crawl for Web and File Share Content: Crawl Quick Reference

Google Search Appliance software version 6.0
Posted June 2009

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter provides reference information about crawl administration tasks.

Contents

  1. Crawl Features
  2. Crawl Administration Tasks
  3. Admin Console Crawl Pages

Crawl Features

The following table lists Google Search Appliance crawl features. For each feature, the table lists the page in the Admin Console where you can use the feature and a reference to a section in this document that describes it.

Feature Admin Console Page Reference
Always force recrawl URLs Crawl and Index > Freshness Tuning Freshness Tuning
Content statistics Crawl and Index > Content Statistics Using the Admin Console to Monitor a Crawl
Continuous crawl Crawl and Index > Crawl Schedule Selecting a Crawl Mode
Crawl diagnostics Crawl and Index > Crawl Diagnostics Using the Admin Console to Monitor a Crawl
Crawl frequently URLs Crawl and Index > Freshness Tuning Freshness Tuning
Crawl infrequently URLs Crawl and Index > Freshness Tuning Freshness Tuning
Crawl modes Crawl and Index > Crawl Schedule Selecting a Crawl Mode
Crawl queue snapshots Crawl and Index >  Crawl Queue Using the Admin Console to Monitor a Crawl
Crawl schedule Crawl and Index > Crawl Schedule Scheduling a Crawl
Crawl status Crawl and Index > Crawl Status Using the Admin Console to Monitor a Crawl
Crawl URLs Crawl and Index > Crawl URLs Configuring a Crawl
Do not crawl URLs Crawl and Index > Crawl URLs Configuring a Crawl
Document dates Crawl and Index > Document Dates Defining Document Date Rules
Duplicate hosts Crawl and Index > Duplicate Hosts Preventing Crawling of Duplicate Hosts
Follow and crawl only URLs Crawl and Index > Crawl URLs Configuring a Crawl
Freshness tuning Crawl and Index > Freshness Tuning Freshness Tuning
Host load exceptions Crawl and Index > Host Load Schedule Configuring Web Server Host Load Schedules
Host load schedule Crawl and Index > Host Load Schedule Configuring Web Server Host Load Schedules
HTTP headers Crawl and Index > HTTP Headers Identifying the User Agent Name
Maximum number of URLs to crawl Crawl and Index > Host Load Schedule Configuring Web Server Host Load Schedules
Proxy servers Crawl and Index > Proxy Servers Crawling Over Proxy Servers
Recrawl URLs Crawl and Index > Freshness Tuning Freshness Tuning
Scheduled crawl Crawl and Index > Crawl Schedule Selecting a Crawl Mode
Start crawling from the following URLs Crawl and Index > Crawl URLs Configuring a Crawl
Web server host load Crawl and Index > Host Load Schedule Configuring Web Server Host Load Schedules

 

Crawl Administration Tasks

The following table lists Google Search Appliance crawl administration tasks. For each task, the table gives a reference to a section in this document that describes it, as well as the page in the Admin Console that you use to accomplish the task.

Task Reference Admin Console Page
Prepare your data for crawling: robots.txt, Robots META tags, googleoff/googleon tags, no_crawl directories, shared folders, and jump pages Preparing Data for a Crawl  
Setup the crawl path: start URLs, follow and crawl URLs, do not crawl URLs Configuring a Crawl Crawl and Index > Crawl URLs
Test URL patterns in the crawl path Testing Your URL Patterns
Select a crawl mode: continuous crawl or scheduled crawl Selecting a Crawl Mode Crawl and Index > Crawl Schedule
Schedule a crawl Scheduling a Crawl
Configure a continuous crawl: URLs to crawl frequently, URLs to crawl infrequently, URLs to always force recrawl Freshness Tuning Crawl and Index > Freshness Tuning
Pause or restart a continuous crawl Stopping, Pausing, or Resuming a Crawl Status and Reports > Crawl Status
Stop a scheduled crawl
Submit a URL to be recrawled Freshness Tuning Crawl and Index > Freshness Tuning
Submitting a URL for Recrawl Status and Reports > Crawl Diagnostics
Set up proxies for Web servers Crawling Over Proxy Servers Crawl and Index > Proxy Servers
Locate or change the user-agent name Identifying the User Agent Name Crawl and Index > HTTP Headers
Enter additional HTTP headers for the search appliance crawler to use
Prevent recrawling of content that resides on duplicate hosts Preventing Crawling of Duplicate Hosts Crawl and Index > Duplicate Hosts
Define rules for the search appliance crawler to use as it indexes documents Defining Document Date Rules Crawl and Index > Document Dates
Specify the maximum number of URLs to crawl for a host and the average number of concurrent connections to open to each Web server for crawling Configuring Web Server Host Load Schedules Crawl and Index > Host Load Schedule
View the current crawl mode and summary information about events of the past 24 hours in a crawl Using the Admin Console to Monitor a Crawl Status and Reports > Crawl Status
View crawl history for all hosts, a specific host, or a specific file Status and Reports > Crawl Diagnostics
Define and view a snapshot of uncrawled URLs in the crawl queue Status and Reports > Crawl Queue
View summary information about files that have been crawled Status and Reports > Content Statistics
View current license information What Is the Search Appliance License Limit? Administration > License

 

Back to top

Admin Console Crawl Pages

The following table lists Google Search Appliance Admin Console pages that are used to administer a crawl. For each Admin Console page, the table provides a reference to a section in this document that describes using the page.

Admin Console Page Reference
Crawl and Index > Crawl URLs Configuring a Crawl
Crawl and Index > Crawl Schedule Selecting a Crawl Mode
Scheduling a Crawl
Crawl and Index > Proxy Servers Crawling over Proxy Servers
Crawl and Index > HTTP Headers Identifying the User Agent
Crawl and Index > Duplicate Hosts Preventing Crawling of Duplicate Hosts
Crawl and Index > Document Dates Defining Document Date Rules
Crawl and Index > Host Load Schedule Configuring Web Server Host Load Schedules
Crawl and Index > Freshness Tuning Freshness Tuning
Status and Reports > Crawl Status Using the Admin Console to Monitor a Crawl
Status and Reports > Crawl Diagnostics
Status and Reports > Crawl Queue
Status and Reports > Content Statistics

 

Back to top