Google Search Appliance software version 6.0
Posted June 2009
Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators how to monitor a crawl. It also describes how to troubleshoot some common problems that may occur during a crawl.
The Admin console provides Status and Reports pages that enable you to monitor crawling. The following table describes monitoring tasks that you can perform using these pages.
| Task | Admin Console Page | Comments |
|---|---|---|
| Monitor crawling status | Status and Reports > Crawl Status | While the Google Search Appliance is crawling, you can view summary information about events of the past 24 hours using the Status and Reports > Crawl Status page. You can also use this page to stop a scheduled crawl, or to pause or restart a continuous crawl. |
| Monitor crawling crawl | Status and Reports > Crawl Diagnostics | While the Google Search Appliance is crawling, you can view its history using the Status and Reports > Crawl Diagnostics page. Crawl diagnostics, as well as search logs and search reports, are organized by collection. When the Status and Reports > Crawl Diagnostics page first appears, it shows the crawl history for the current domain. It shows each URL that has been fetched and timestamps for the last 10 fetches. If the fetch was not successful, an error message is also listed. From the domain level, you can navigate to lower levels that show the history for a particular host, directory, or URL. At each level, the Status and Reports > Crawl Diagnostics page displays information that is pertinent to the selected level. At the URL level, the Status and Reports > Crawl Diagnostics page shows summary information as well as a detailed Crawl History. You can also use this page to submit a URL for recrawl. |
| Take a snapshot of the crawl queue | Status and Reports > Crawl Queue | Any time while the Google Search Appliance is crawling, you can define and view a snapshot of the queue using the Status and Reports > Crawl Queue page. A crawl queue snapshot displays URLs that are waiting to be crawled, as of the moment of the snapshot. For each URL, the snapshot shows:
|
| View information about crawled files | Status and Reports > Content Statistics | At any time while the Google Search Appliance is crawling, you can view summary information about files that have been crawled using the Status and Reports > Content Statistics page. You can also use this page to export the summary information to a comma-separated values file. |
In the Crawl History for a specific URL on the Status and Reports > Crawl Diagnostics page, the Crawl Status column lists various messages, as described in the following table.
| Crawl Status Message | Description |
|---|---|
| Crawled: New Document | The Google Search Appliance successfully fetched this URL. |
| Crawled: Cached Version | The Google Search Appliance crawled the cached version of the document. The search appliance sent an if-modified-since field in the HTTP header in its request and received a 304 response, indicating that the document is unchanged since the last crawl. |
| Retrying URL: Connection Timed Out | The Google Search Appliance set up a connection to the Web server and sent its request, but the Web server did not respond within three minutes or the HTTP transaction didn't complete after 3 minutes. |
| Retrying URL: Host Unreachable while trying to fetch robots.txt | The Google Search Appliance could not connect to a Web server when trying to fetch robots.txt. |
| Retrying URL: Received 500 server error | The Google Search Appliance received a 500 status message from the Web server, indicating that there was an internal error on the server. |
| Excluded: Document not found (404) | The Google Search Appliance did not successfully fetch this URL. The Web server responded with a 404 status, which indicates that the document was not found. If a URL gets a status 404 when it is recrawled, it is removed from the index within 30 minutes. |
| Cookie Server Failed | The Google Search Appliance did not successfully fetch a cookie using the cookie rule. Before crawling any Web pages that match patterns defined for Forms Authentication, the search appliance executes the cookie rules. |
| Error: Permanent DNS failure | The Google Search Appliance cannot resolve the host. Possible reasons can be a change in your DNS servers while the appliance still tries to access the previously cached IP. The crawler caches the results of DNS queries for a long time regardless of the TTL values specified in the DNS response. A workaround is to save and then revert a pattern change on the Crawl and Index > Proxy Servers page. Saving changes here causes internal processes to restart and flush out the DNS cache. |
When crawling, the Google Search Appliance tests network connectivity by attempting to fetch every start URL every 30 minutes. If less than 10% return OK responses, the search appliance assumes that there are network connectivity issues with a content server and slows down or stops and displays the following message: "Crawl has stopped because network connectivity test of Start URLs failed." The crawl restarts when the start URL connectivity test returns an HTTP 200 response.
The Status and Reports > Crawl Status page in the Admin Console displays the Current Crawl Rate, which is the number of URLs being crawled per second. Slow crawling may be caused by the following factors:
These factors are described in the following sections.
The Google Search Appliance converts non-HTML documents, such as PDF files and Microsoft Office documents, to HTML before indexing them. This is a CPU-intensive process that can take up to five seconds per document. If more than 100 documents are queued up for conversion to HTML, the search appliance stops fetching more URLs.
You can see the HTML that is produced by this process by clicking the cached link for a document in the search results.
If the search appliance is crawling a single UNIX/Linux Web server, you can run the tail command-line utility on the server access logs to see what was recently crawled. The tail utility copies the last part of a file. You can also run the tcpdump command to create a dump of network traffic that you can use to analyze a crawl.
If the search appliance is crawling multiple Web servers, it can crawl through a proxy.
Crawling many complex documents can cause a slow crawl rate.
To ensure that static complex documents are not recrawled as often as dynamic documents, add the URL patterns to the Crawl Infrequently URLs on the Crawl and Index > Freshness Tuning page.
If the Google Search Appliance crawler receives many temporary server errors (500 status codes) when crawling a host, crawling slows down.
To speed up crawling, you may need to increase the value of concurrent connections to the Web server by using the Crawl and Index > Hostload Schedule page.
Network problems, such as latency, packet loss, or reduced bandwidth can be caused by several factors, including:
To find out what is causing a network problem, you can run tests from a device on the same network as the search appliance.
Use the wget program (available on most operating systems) to retrieve some large files from the Web server, with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network problems.
Run the traceroute network tool from a device on the same network as the search appliance and the Web server. If your network does not permit Internet Control Message Protocol (ICMP), then you can use tcptraceroute. You should run the traceroute with both crawling running and crawling paused. If it takes significantly longer with crawling running, you may have network performance problems.
Packet loss is another indicator of a problem. You can narrow down the network hop that is causing the problem by seeing if there is a jump in the times taken at one point on the route.
If response times are slow, you may have a slow Web server. To find out if your Web server is slow, use the wget command to retrieve some large files from the Web server. If it takes approximately the same time using wget as it does while crawling, you may have a slow Web server.
You can also log in to a Web server to determine whether there are any internal bottlenecks.
If you have a slow host, the search appliance crawler fetches lower-priority URLs from other hosts while continuing to crawl the slower host.
The crawl processes on the search appliance are run at a lower priority than the processes that serve results. If the search appliance is heavily loaded serving search queries, the crawl rate drops.
During continuous crawling, you may find that the Google Search Appliance is not recrawling URLs as quickly as specified by scheduled crawl times in the crawl queue snapshot. The amount of time that a URL has been in the crawl queue past its scheduled recrawl time is the URL's "wait time."
Wait times can occur when your enterprise content includes:
If the search appliance crawler needs four hours to catch up to the URLs in the crawl queue whose scheduled crawl time has already passed, the wait time for crawling the URLs is four hours. In extreme cases, wait times can be several days. The search appliance cannot recrawl a URL more frequently than the wait time.
It is not possible for an administrator to view the maximum wait time for URLs in the crawl queue or to view the number of URLs in the queue whose scheduled crawl time has passed. However, you can use the Status and Reports > Crawl Queue page to create a crawl queue snapshot, which shows:
If the Google Search Appliance receives an error when fetching a URL, it records the error in Status and Reports > Crawl Diagnostics and schedules a retry after a certain time interval. The search appliance maintains an error count for each URL, and the time interval between retries increases as the error count rises. The maximum retry interval is three weeks.
The search appliance crawler distinguishes between permanent and temporary errors. There is a lower retry interval for temporary errors than for permanent errors.
Permanent errors occur when the document is no longer reachable using the URL. When the search appliance encounters a permanent error, it removes the document from the crawl queue and the index, if present.
Temporary errors occur when the URL is unavailable because of a temporary move or a temporary user or server error. When the search appliance encounters a temporary error, it retains the document in the crawl queue and the index, with the intention of recrawling it at a later time.
The following table lists permanent and temporary Web server errors. For detailed information about HTTP status codes, click here.