Google Search Appliance software version 4.6
Google Mini software version 4.6
Posted July 2007
This document provides an overview of how the Google Search Appliance and the Google Mini crawl and index enterprise content.
For the Google Search Appliance, information about continuous crawl applies to software version 4.2 (unless otherwise noted), and information about full crawl and file system crawl applies to software version 4.6 and later.
For the Google Mini, all information applies to software version 4.4 and later.
Before anyone can use the Google Search Appliance or Google Mini to search your enterprise content, the search appliance must build the search index, which enables search queries to be quickly matched to results. To build the search index, the search appliance must browse, or "crawl" your enterprise content, as illustrated in the following example.
The administration at Missitucky University plans to offer its staff, faculty, and students simple, fast, and secure search across all their content using the Google Search Appliance. To achieve this goal, the search appliance must crawl their content, starting at the Missitucky University Web site's home page.
Missitucky University has a Web site that provides categories of information such as Admissions, Class Schedules, Events, and News Stories. The Web site's home page lists hyperlinks to other URLs for pages in each of these categories. For example, the News Stories hyperlink on the home page points to a URL for a page that contains hyperlinks to all recent news stories. Similarly, each news story contains hyperlinks that point to other URLs.
The relations among the hyperlinks within the Missitucky University Web site constitute a virtual web, or pathway that connects the URLs to each other. Starting at the home page and following this pathway, the search appliance can crawl from URL to URL, browsing content as it goes.
Crawling Missitucky University's content actually begins with a list of URLs ("start URLs") where the search appliance should start browsing; in this example, the first start URL is the Missitucky University home page.
The search appliance visits the Missitucky University home page, then it:
By repeating these steps for each URL in the crawl queue, the search appliance can crawl all of Missitucky University's content. As a result, the search appliance gathers the information that it needs to build the search index, and ultimately, to serve search results to end users.
Because Missitucky University's content changes constantly, the search appliance continuously crawls it to keep the search index and the search results up-to-date.
Both the Google Search Appliance and Google Mini support two modes of crawling:
For information about choosing a crawl mode and starting a crawl see Running a Crawl.
In continuous crawl mode, the search appliance is crawling your enterprise content at all times, ensuring that newly added or updated content is added to the index as quickly as possible. After the Google Search Appliance or Google Mini is installed, it defaults to continuous crawl mode and establishes the default collection.
The search appliance can automatically determine URLs that often change and should be crawled frequently and URLs that seldom change and should be crawled infrequently. Each URL is crawled twice its "change interval," which is how often it is observed to change. For example, if your corporate intranet portal page changes every 24 hours, the search appliance crawls it every 12 hours.
The search appliance does not recrawl any URLs until all new URLs have been discovered or the license limit has been reached. A URL in the index is recrawled even if there are no longer any links to that URL from other pages in the index.
In full crawl mode, the search appliance crawls all of your enterprise content once, at a scheduled time.
Both the Google Search Appliance and Google Mini can crawl and index content that is stored in the following types of sources:
Crawling FTP is not supported on either the Google Search Appliance or the Google Mini.
The next sections describe each type of source.
Public Web content is available to all users. The Google Search Appliance and Google Mini can crawl and index both public and secure enterprise content that resides on a variety of Web servers, including these:
Secure Web content is protected by authentication mechanisms and is available only to users who are members of certain authorized groups. Both the Google Search Appliance and Google Mini can crawl and index secure content protected by:
Only the Google Search Appliance can crawl and index content protected by forms-based single sign-on systems.
For HTTPS websites, the search appliance uses a serving certificate as a client certificate when crawling. You can upload a new serving certificate using the Admin Console. Some Web servers do not accept client certificates unless they are signed by trusted Certificate Authorities.
Both the Google Search Appliance and Google Mini can also crawl several file formats, including Microsoft Word, Excel, and Adobe PDF that reside on network file shares. The shares can use these protocols:
For a complete list of supported file formats, refer to Help Center > More Information > Crawling and Indexing in the Admin Console.
File system crawling does not support the Google Search Appliance and Google Mini's serve-time security features. All content retrieved during file system crawling will be available to all search users, whether or not they present authentication credentials. If it is necessary to control access to search results that cover content on a network file share, consider setting up directory browsing on a Web server that can access this file share. This enables the search appliance to crawl the file share as Web content.
The Google Search Appliance and Google Mini do not crawl or index enterprise content that is excluded by these mechanisms:
Also the search appliance cannot:
The following sections describe all these exclusions.
A search appliance administrator can prohibit the search appliance crawler from following and indexing particular URLs. For example, any URL that should not appear in search results or be counted as part of the search appliance license limit should be excluded from crawling. For more information, refer to Configuring a Crawl.
To prohibit any crawler from accessing all or some of the content on an HTTP or HTTPS site, a content server administrator or webmaster typically adds a robots.txt file to the content server or Web site. This file tells the crawlers to ignore all or some files and directories on the server or site. Documents crawled using other protocols, such as SMB, are not affected by the restrictions of robots.txt.
The search appliance crawler always obeys the rules in robots.txt. You cannot override this feature. Before crawling HTTP or HTTPS URLs on a host, the search appliance fetches the robots.txt file. For example, before crawling any URLs on http://www.mycompany.com/ or https://www.mycompany.com/, the search appliance fetches http://www.mycompany.com/robots.txt.
When the search appliance requests the robots.txt file, the host returns an HTTP response that determines whether or not the search appliance can crawl the site. The following table lists HTTP responses and how the search appliance crawler responds to them.
| HTTP Response | File Returned? | Search Appliance Crawler Response |
|---|---|---|
| 200 OK |
Yes | The search appliance crawler obeys exclusions specified by robots.txt when fetching URLs on the site. |
| 404 Not Found |
No | The search appliance crawler assumes that there are no exclusions to crawling the site and proceeds to fetch URLs. |
| Other responses | The search appliance crawler assumes that it is not permitted to crawl the site and does not fetch URLs. |
When crawling, the search appliance caches robots.txt files and periodically tests sites for changes to these files. If changes to a robots.txt file prohibit access to documents that have already been indexed, those documents are removed from the index. If the search appliance can no longer access robots.txt on a particular site, all the URLs on that site are removed from the index.
For general information on how robots.txt files, as well as how Robots META tags work, see the Robots exclusion standard at: http://www.robotstxt.org/wc/exclusion.html.
The search appliance does not crawl a Web page if it has been marked with the nofollow Robots META tag.
The search appliance does not crawl links that are embedded within an area tag. The HTML area tag is used to define a mouse-sensitive region on a page, which can contain a hyperlink. When the user moves the pointer into a region defined by an area tag, the arrow pointer changes to a hand and the URL of the associated hyperlink appears at the bottom of the window.
For example, the following HTML defines an region that contains a link:
<map
name="n5BDE56.Body.1.4A70"> <area shape="rect" coords="0,116,311,138" id="TechInfoCenter" href="http://www.bbb.com/main/help/ourcampaign/ourcampaign.htm" alt=""></map>
When the search appliance crawler follows newly discovered links in URLs, it does not follow the link (http://www.bbb.com/main/help/ourcampaign/ourcampaign.htm) within this area tag.
Because the search appliance crawler discovers new content by following links within documents, it cannot find a URL that is not linked from another document through this process.
You can enable the search appliance crawler to discover any unlinked URLs in your enterprise content by:
Before crawling starts, the search appliance administrator configures the crawl path, which includes URLs where crawling should start, as well as URL patterns that the crawler should follow and should not follow. Other information that webmasters, content owners, and search appliance administrators typically prepare before crawling starts includes:
This section describes how the both the Google Search Appliance and Google Mini crawl Web and network file share content as it applies to both full crawl and continuous crawl modes.
This section contains data flow diagrams, used to illustrate how the search appliance crawls enterprise content. The following table describes the symbols used in these diagrams.
| Symbol | Definition | Example |
|---|---|---|
![]() |
Start state or Stop state | Start crawl, end crawl |
![]() |
Process | Follow links within the document |
| Data store, which can be a database, file system, or any other type of data store | Crawl queue | |
| Data flow among processes, data stores, and external interactors | URLs | |
| External input or terminator, which can be a process in another diagram | Delete URL | |
![]() |
Callout |
The following diagram provides an overview of the following major crawling processes:
The sections following the diagram provide details about each of the these major processes.

The crawl queue is a list of URLs that the search appliance will crawl. The search appliance associates each URL in the crawl queue with a priority, typically based on estimated Enterprise PageRank. Enterprise PageRank is a measure of the relative importance of a Web page within the set of your enterprise content. It is calculated using a link-analysis algorithm similar to the one used to calculate PageRank on google.com.
The order in which the search appliance crawls URLs is determined by the crawl queue. The following table gives an overview of the priorities assigned to URLs in the crawl queue.
| Source of URL | Basis for Priority |
|---|---|
| Start URLs (highest) | Fixed priority |
| New URLs that have never been crawled | Estimated Enterprise PageRank |
| Newly discovered URLs | For a new crawl, estimated Enterprise PageRank |
| For a recrawl, estimated Enterprise PageRank and a factor that ensures that new documents are crawled before previously indexed content | |
| URLs that are already in the index (lowest) | Enterprise PageRank, the last time it was crawled, and estimated change frequency |
By crawling URLs in this priority, the search appliance ensures that the freshest, most relevant enterprise content appears in the index.
Tip: Although it is not possible to view the Enterprise PageRank for a URL, you can view the RK (Ranking Value) tag in XML seach results. The RK tag is an approximate indicator of the Enterprise PageRank of a URL. To request XML search results, use the output parameter in a search request, for example: output=xml.
After configuring the crawl path and preparing content for crawling, the search appliance administrator starts a continuous or full crawl. The following diagram provides an overview of starting the crawl and populating the crawl queue.

When crawling begins, the search appliance populates the crawl queue with URLs. The following table lists the contents of the crawl queue for a new crawl and a recrawl.
| Type of Crawl | Crawl Queue Contents |
|---|---|
| New crawl | The start URLs that the search appliance administrator has configured. |
| Recrawl | The start URLs that the search appliance administrator has configured and the complete set of URLs contained in the current index. |
The search appliance crawler attempts to fetch the URL with the highest priority in the crawl queue. The following diagram provides an overview of this process.

If the search appliance successfully fetches a URL, it downloads the document and caches it for indexing. Generally, if the search appliance fails to fetch a URL, it deletes the URL from the crawl queue. Depending on several factors, the search appliance may take further action when it fails to fetch a URL.
When fetching documents from a slow server, the search appliance paces the process so that it does not cause server problems. The search appliance administrator can also adjust the number of concurrent connections to a server by configuring the web server host load schedule.
When the search appliance successfully fetches a document, it caches a copy of the document. To detect changes to cached documents when recrawling it, the search appliance:
Web servers that support if-modified-since fields in HTTP headers return an HTTP 304 (Not Modified) response if the content has not changed since the if-modified-since date. If the search appliance receives a 304 response, it takes the content from its cache rather than from the remote Web server. The Status and Reports > Crawl Diagnostics page in the Admin Console shows that the URL has been crawled from cache.
When the search appliance fetches a URL from a file share, the object that it actually retrieves and the method of processing it depends on the type of object that is requested. For each type of object requested, the following table provides an overview of the process that the search appliance follows. For information on how these objects are counted as part of the search appliance license limit, refer to When Is a Document Counted as Part of the License Limit?
| Requested Object | Search Appliance Process Overview |
|---|---|
| Document |
|
| Directory |
|
| Share |
|
| Host |
|
When the search appliance successfully fetches a document, it determines the size and type of the the file. If the file is:
For each document that it indexes, the search appliance follows newly discovered URLs (HTML links) within that document.
Before following a newly discovered link, the search appliance checks the URL against:
If the URL passes these checks, the search appliance adds the URL to the crawl queue, and eventually crawls it. If the URL does not pass these checks, the search appliance deletes it from the crawl queue. The following diagram provides an overview of this process.

The search appliance crawler only follows HTML links in the following format:
<a href="/page2.html">link to page 2</a>
It follows HTML links in PDF files, Word documents, and Shockwave documents. The search appliance crawler does not follow HTML links embedded in Javascript code.
The search appliance administrator can end a continuous crawl by pausing it.
The search appliance administrator can configure a full crawl to end at:
A full crawl also ends when the license limit is reached. The following table provides more details about the conditions that cause a full crawl to end.
| Condition | Description |
|---|---|
| Scheduled end time | Crawling stops at its scheduled end time. |
| Crawl to completion | There are no more URLs in the crawl queue. The search appliance crawler has discovered and attempted to fetch all reachable content that matches the configured URL patterns. |
| The license limit is reached | The search appliance license limits the maximum number of URLs in the index to 20% more than the license limit (this figure is known as the "hard limit"). When the search appliance reaches this limit, it stops crawling new URLs, even if they are above the Enterprise PageRank "threshold," which is the lowest Enterprise PageRank of a URL that is within the license limit. The search appliance removes the excess URLs from the crawl queue. |
For both full crawls and continuous crawls, documents usually appear in search results approximately 30 minutes after they are crawled. This period can increase if the system is under a heavy load, or if there are many non-HTML documents.
For a recrawl, if an older version of a document is cached in the index from a previous crawl, the search results refer to the cached document until the new version is available.
During a continuous crawl, the search appliance starts recrawling previously crawled URLs when either of the following conditions are true:
In continuous crawl mode, the search appliance tries to recrawl a URL twice as frequently as its contents change. The maximum frequency of a recrawl is every 15 minutes.
To determine how often a URL should be recrawled, the search appliance crawler uses a URL's change interval, which is an estimate of the time between changes to the URL. In the initial stages of a crawl, the search appliance does not have a record of the change interval of a URL, so it assigns each document a default change interval sometime between these values:
Every time the search appliance fetches a document, it records a checksum of the document's contents, including HTML tags and white space. Then the change interval is adjusted closer to the length of time between the last two fetches of the document.
Tip: It is not possible to view the change interval for a URL, or the time that it is scheduled to be recrawled. However, your Web server access logs show a content-length value for a URL that the crawler has fetched. This value can help you to get an idea of how often the URL changes.
The search appliance automatically schedules recrawling of URLs. Each URL is scheduled for recrawling at a specific time, which is calculated using the following formula:
new-crawl-time=last-crawl-time+.5(change interval)
For example if a URL is crawled on Friday, February 4 at 09:28, and its change interval is two days and four hours, then it is scheduled to be recrawled one day and two hours later, on Saturday, February 5 at 11:28.
The search appliance administrator can submit a URL for recrawl or adjust the crawl interval using the Freshness Tuning page.
When crawling, the search appliance tests network connectivity by attempting to fetch every start URL every 30 minutes. If approximately 10% of the start URLs return HTTP 200 (OK) responses, the search appliance assumes that there are no network connectivity issues. If less than 10% return OK responses, the search appliance assumes that there are network connectivity issues with a content server and slows down or stops.
During a temporary network outage, slowing or stopping a crawl prevents the search appliance from removing URLs that it cannot reach from the index. The crawl speeds up or restarts when the start URL connectivity test returns an HTTP 200 response.
Your Google search appliance license determines the number of documents that can appear in your index. The limit on the number of licenses differs for the Google Search Appliance and the Google Mini.
For a Google Search Appliance, between 3 million and 30 million documents can appear in the index, depending on your model.
For example, if the license limit is 3 million, the search appliance crawler attempts to put the 3 million documents with the highest Enterprise PageRank in the index. During a recrawl, when the crawler discovers a new URL, it must decide whether to crawl the document. If the crawler discovers at least 3 million documents with higher Enterprise PageRank than a new URL, it does not crawl the new URL. If not, the search appliance crawls the new document and adds it to the index. If there are more than 3 million documents in the index when the crawl completes, a periodic process removes the excess URLs with the lowest PageRank.
For the Google Mini, up to 300,000 documents can appear in the index.
For example, if the license limit is 50,000, the search appliance crawler attempts to put the 50,000 documents with the highest Enterprise PageRank in the index. During a recrawl, when the crawler discovers a new URL, it must decide whether to crawl the document. If the crawler discovers at least 50,000 documents with higher Enterprise PageRank than a new URL, it does not crawl the new URL. If not, the search appliance crawls the new document and adds it to the index. If there are more than 50,000 documents in the index when the crawl completes, a periodic process removes the excess URLs with the lowest PageRank.
Generally, when the search appliance successfully fetches a document, it is counted as part of the license limit. If the search appliance does not successfully fetch a document, it is not counted as part of the license limit. The following table provides an overview of the conditions that determine whether or not a document is counted as part of the license limit.
| Condition | Counted as Part of the License Limit? |
|---|---|
| The search appliance fetches a URL without errors. This includes HTTP responses 200 (success), 301 (redirect, URL moved permanently), 302 (redirect, URL moved temporarily), and 304 (not modified) |
The URL is counted as part of the license limit. |
| The search appliance cannot fetch a URL. Instead, the search appliance receives an HTTP error response, such as 404 (document not found) or 500 (temporary server error). | The URL is not counted as part of the license limit. |
| The search appliance fetches two URLs that contain exactly the same content without errors. | Both URLs are counted as part of the license limit, but the one with the lower Enterprise PageRank is automatically filtered out of search results. It is not possible to override this automatic filtering. |
| The search appliance fetches a document from a file share. | The document is counted as part of the license limit. |
| The search appliance retrieves a list of files and subdirectories and in a file share and converts it to a directory listings page. | Each directory in the list is counted as part of the license limit, even if the directory is empty. |
| The search appliance retrieves a list of file shares on a host and converts it to a share listings page. | Each share in the list is counted as part of the license limit. |
To view license information for your search appliance, use the Administration>License page in the Admin Console.
To enable search results to be sorted and presented based on dates, the search appliance extracts dates from documents according to rules configured by the search appliance administrator.
In Google Search Appliance software version 4.4.68 and later, document dates are extracted from Web pages when the document is indexed.
The search appliance extracts the first date for a document with a matching URL pattern that fits the date format associated with the rule. If a date is written in an ambiguous format, the search appliance assumes that it matches the most common format among URLs that match each rule for each domain that is crawled. For this purpose, a domain is one level above the top level. For example, mycompany.com is a domain, but intranet.mycompany.com is not a domain.
The search appliance periodically runs a process that calculates which of the supported date formats is the most common for a rule and a domain. After calculating the statistics for each rule and domain, the process may modify the dates in the index. The process first runs 12 hours after the search appliance is installed, and thereafter, every seven days. The process also runs each time you change the document date rules.
The search appliance will not change which date is most common for a rule until after the process has run. Regardless of how often the process runs, the search appliance will not change the date format more than once a day. The search appliance will not change the date format unless 5,000 documents have been crawled since the process last ran.
If you import a configuration file with new document dates after the process has first run, then you may have to wait at least seven days for the dates to be extracted correctly. The reason is that the date formats associated with the new rules are not calculated until the process runs. If no dates were found the first time the process ran, then no dates are extracted until the process runs again.
If no date is found, the search appliance indexes the document without a date.
Normally, document dates appear in search results about 30 minutes after they are extracted. In larger indexes, the process can several hours to complete because the process may have to look at the contents of every document.
The search appliance index includes all the documents it has crawled. These documents remain in the index and the search appliance continues to crawl them until either one of the following conditions is true:
The search appliance administrator can also remove documents from the index manually.
Removing all links to a document in the index does not remove the document from the index.
Every six hours, the search appliance runs a process that removes documents from the index. The following table describes the conditions that cause documents to be removed from the index.
| Condition | Description |
|---|---|
| The license limit is exceeded | If the number of documents in the index exceeds the license limit, documents with the lowest Enterprise PageRank are removed from the index until the license limit is met. The limit on the number of URLs in the index is the smaller of the two following values from the Admin Console:
This process occurs up to six hours after the crawl completes. Until the removal process runs, the additional documents are returned in search results. |
| The crawl pattern is changed | To determine which content should be included in the index, the search appliance uses the follow and crawl and do not crawl URL patterns specified on the Crawl and Index > Crawl URLs page. If these URL patterns are modified, the search appliance examines each document in the index to determine whether it should be retained or removed. If the URL does not match any follow and crawl patterns, or if it matches any do not crawl patterns, it is removed from the index. Document URLs disappear from search results between 15 minutes and six hours after the pattern changes, depending on system load. |
| The robots.txt file is changed | If the robots.txt file for a content server or web site has changed to prohibit search appliance crawler access, URLs for the server or site are removed from the index. |
| Document is not found (404) | If the search appliance receives a 404 (Document not found) error from the Web server when attempting to fetch a document, the document is removed from the index. |
Note: Search appliance software versions prior to 4.6 removed documents from the index using a process called the "remove doc ripper."
Last modified:
Updated on