Google Search Appliance software version 6.0
Posted June 2009
Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators how to start a crawl.
Before crawling starts, you must use the Crawl and Index > Crawl Schedule page in the Admin Console to select one of the following the crawl modes:
If you select scheduled crawl, you must schedule a time for crawling to start and a duration for the crawl. If you select and save Continuous crawl mode, crawling starts and a link to the Freshness Tuning page appears.
For complete information about the Crawl and Index > Crawl Schedule page, click Help Center > Crawl and Index > Crawl Schedule in the Admin Console.
The search appliance starts crawling in scheduled crawl mode according to a schedule that you specify using the Crawl and Index > Crawl Schedule page in the Admin Console. Using this page, you can specify:
Using the Status and Reports > Crawl Status page in the Admin Console, you can:
When you stop crawling:
When you pause crawling, the Google Search Appliance only stops crawling documents in the index. Connectivity tests still run every 30 minutes for Start URLs. You may notice this activity in access logs.
For complete information about the Status and Reports > Crawl Status page, click Help Center > Status and Reports > Crawl Status in the Admin Console.
Occasionally, there may be a recently changed URL that you want to be recrawled sooner than the Google Search Appliance has it scheduled for recrawling. Provided that the URL has been previously crawled, you can submit it for immediate recrawling from the Admin Console using one of the following methods:
URLs that you submit for recrawling are treated the same way as new, uncrawled URLs in the crawl queue. They are scheduled to be crawled in order of Enterprise PageRank, and before any URLs that the search appliance has automatically scheduled for recrawling.
When you trigger a URL to be recrawled using the Admin Console:
So there may be a time lag of up to 20 hours before the URL that you have submitted is recrawled.
The process of crawling a database is called "synchronizing" a database. After you configure database crawling, you can start synchronizing a database by using the Crawl and Index > Databases page in the Admin Console.
To synchronize a database:
The database synchronization runs until it is complete.
For more information about starting a database crawl, refer to Database Crawling and Serving.