My favorites | English | Sign in

Try Google Chrome's developer tools New!

Google Search Appliance

Administering Crawl for Web and File Share Content: Running a Crawl

Google Search Appliance software version 6.0
Posted June 2009

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators how to start a crawl.

Contents

  1. Selecting a Crawl Mode
  2. Scheduling a Crawl
  3. Stopping, Pausing, or Resuming a Crawl
  4. Submitting a URL to Be Recrawled
  5. Starting a Database Crawl

Selecting a Crawl Mode

Before crawling starts, you must use the Crawl and Index > Crawl Schedule page in the Admin Console to select one of the following the crawl modes:

If you select scheduled crawl, you must schedule a time for crawling to start and a duration for the crawl. If you select and save Continuous crawl mode, crawling starts and a link to the Freshness Tuning page appears.

For complete information about the Crawl and Index > Crawl Schedule page, click Help Center > Crawl and Index > Crawl Schedule in the Admin Console.

Scheduling a Crawl

The search appliance starts crawling in scheduled crawl mode according to a schedule that you specify using the Crawl and Index > Crawl Schedule page in the Admin Console. Using this page, you can specify:

  • The day, hour, and minute when crawling should start
  • Maximum duration for crawling

Back to top

Stopping, Pausing, or Resuming a Crawl

Using the Status and Reports > Crawl Status page in the Admin Console, you can:

  • Stop crawling (scheduled crawl mode)
  • Pause crawling (continuous crawl mode)
  • Resume crawling (continuous crawl mode)

When you stop crawling:

  • The documents that were crawled remain in the index
  • The index contains some old documents and some newly crawled documents

When you pause crawling, the Google Search Appliance only stops crawling documents in the index. Connectivity tests still run every 30 minutes for Start URLs. You may notice this activity in access logs.

For complete information about the Status and Reports > Crawl Status page, click Help Center > Status and Reports > Crawl Status in the Admin Console.

Submitting a URL to Be Recrawled

Occasionally, there may be a recently changed URL that you want to be recrawled sooner than the Google Search Appliance has it scheduled for recrawling. Provided that the URL has been previously crawled, you can submit it for immediate recrawling from the Admin Console using one of the following methods:

URLs that you submit for recrawling are treated the same way as new, uncrawled URLs in the crawl queue. They are scheduled to be crawled in order of Enterprise PageRank, and before any URLs that the search appliance has automatically scheduled for recrawling.

When you trigger a URL to be recrawled using the Admin Console:

  1. The document's change interval is reset to the default, that is, a recrawling frequency of between three and 20 days.
  2. Up to 20 hours later, the search appliance schedules recrawling of the URL and change interval is corrected.

So there may be a time lag of up to 20 hours before the URL that you have submitted is recrawled.

Starting a Database Crawl

The process of crawling a database is called "synchronizing" a database. After you configure database crawling, you can start synchronizing a database by using the Crawl and Index > Databases page in the Admin Console.

To synchronize a database:

  1. Click Crawl and Index > Databases.
  2. In the Current Databases section of the page, click the Sync link next to the database that you want to synchronize.

    The database synchronization runs until it is complete.

For more information about starting a database crawl, refer to Database Crawling and Serving.

 

Back to top