Google Search Appliance software version 6.0
Posted June 2009
The Google Search Appliance enables you to provide universal search to your users. You can get the most from your Google Search Appliance by using some or all of its many features to fine-tune and enhance universal search. Become familiar with the Google Search Appliance's features by reading this document and apply those features that best suit your search solution.

After the Google Search Appliance has been set up, you can configure the search appliance to crawl the content sources that you identified during the planning phase, as described in Identifying Content Sources.
Crawl is the process by which the Google Search Appliance discovers enterprise content and creates a master index. The resulting index consists of all of the words, phrases, and meta-data in the crawled documents. When users search for information, their queries are executed against the index rather than the actual documents themselves.
The Google Search Appliance can crawl:
The Google Search Appliance is also capable of indexing:
This section briefly describes how the Google Search Appliance indexes each type of content.
Public content is not restricted in any way; users don't need credentials to view it. Some of the most common forms of public content include:
The Google Search Appliance supports crawling of many types of formats, including word processing, spreadsheet, presentation, and others.
The Google Search Appliance crawls content on web sites or file systems according to crawl patterns that you specify by using the Admin Console. As the search appliance crawls public content sources, it indexes documents that it finds. To find more documents, the crawler follows links within the documents that it indexes. The search appliance does not crawl content that you to exclude from the index.
The following figure provides an overview of crawling public content.

The Google Search Appliance does not crawl unlinked URLs or links that are embedded within an area tag. Also, the search appliance does not crawl or index content that is excluded by these mechanisms:
Typically, webmasters, content owners, and search appliance administrators create robots.txt files and add META tags to documents before a search appliance starts crawling.
To configure a search appliance to crawl a content source, you specify top-level URLs and directory addresses and links that the search appliance should follow by using the Crawl and Index > Crawl URLs page in the Admin Console. In addition to specifying start URLs, you can also specify URLs that the search appliance should not follow and crawl.
By default, the search appliance crawls in continuous crawl mode. This means that after the Google Search Appliance creates the index, it always crawls content sources looking for new or modified content and updates the index to ensure that it contains the freshest listings. The search appliance can also crawl content according to a schedule.
Configure continuous crawl by performing the following steps with the Admin Console:

After you save the URL patterns, the search appliance begins crawling in continuous mode.
If you prefer to have the search appliance crawl according to scheduled times, you must also perform the additional following tasks by using the Crawl and Index > Crawl Schedule page in the Admin Console:
To schedule crawling times for a specific host, you can change the host load and times in the Crawl and Index > Host Load Schedule page. By setting a host load of 0, the crawler will not crawl that host during the configured time period.
If you wish to have a document added to the crawl queue right away, then you can do so by entering in the URL in Re-Crawl These URL Patterns on the Crawl and Index > Freshness Tuning page.
For in-depth information about public crawl, configuring a search appliance to crawl, and starting a crawl, refer to Administering Crawl for Web and File Share Content.
For a complete list of file types that the search appliance can crawl, refer to Indexable File Formats.
Controlled-access content is secure content--it is restricted so that not all users have access to it. For access to controlled-access content, users need authorization.
A search appliance discovers and indexes controlled-access content in the same way that it indexes all other content: by performing a crawl through the content sources. However, the search appliance requires access credentials to discover and index controlled-access content. Once you set up the search appliance with access credentials, it maintains a copy of all crawled content in the index.
The following figure provides an overview of crawling controlled-access content.
Controlled-access methods that the Google Search Appliance supports for crawl include:
If the content files you want crawled and indexed are in a location that requires a login, create a special user account on your network for the search appliance. When you configure crawl on the Admin Console, provide the user name and password for that account. The search appliance presents those credentials before crawling files in that location.
Configure a search appliance to crawl controlled-access content by performing the following steps with the Admin Console:
When a user issues a search request for controlled-access content, the search appliance verifies the user's identity and determines whether the user has authorization to view the content. This check is performed before the search appliance displays any content in search results. By performing the results access control checks in real-time, the Google Search Appliance ensures that users only see results they are authorized to view.
A search appliance can use the following methods to establish the user's identity:
Once the user's identity has been established, a search appliance attempts to determine whether the user has access to the secure content that matches their search. The authorization check is performed based on the security configuration of the search appliance:
If the authorization check is successful, the secure content that matches the search query is included in the user's search results.
The process for configuring serve of controlled-access content is dependent on the security method you want to use, as described in the following list:
For complete information about configuring a search appliance to crawl and serve controlled-access content, refer to Managing Search for Controlled-Access Content.
If your organization has content that is stored in non-web repositories, such as Enterprise Content Management (ECM) systems, you can enable the Google Search Appliance to index and serve this content by using the connector framework.
The Google Search Appliance provides the indexing capabilities for the following content management systems:
Also, Google partners have developed connectors for other non-web repositories. For information about these connectors, visit Google Solutions Marketplace.
The connector manager is the central part of the connector framework for the Google Search Appliance. The Connector Manager itself manages creation, instantiation, scheduling and monitoring of connectors that supply content and provide authentication and authorization services to the Google Search Appliance. Connectors run on connector managers residing on servlet containers installed on computers on your network. All Google-supported connectors are certified on Apache Tomcat 5.5.23.
When connecting to a document repository through an enterprise connector, the Google Search Appliance uses a process called "traversal." During traversal, the connector issues queries to the repository to retrieve document data to feed to the Google Search Appliance for indexing. The connector manager formats the content and any associated metadata for a feed to the Google Search Appliance, which then creates an index of the documents.
The following figure provides an overview of indexing content in non-web repositories.

You can also create a custom connector for the Google Search Appliance, as described in Developing Custom Connectors.
For public content in a repository, searches work the same way as they do with web and file-system content. The Google Search Appliance searches its index and returns relevant result sets to the user without any involvement by the connector.
To authorize access to private or protected content from a repository, the Google Search Appliance creates a connector instance at query time. The connector instance forwards authentication credentials to the repository for authorization checking. The connector manager recognizes identities passed from basic authentication, SAML authentication, and client certificates. If a SAML authentication provider is setup to support single sign-on (SSO), the connector manager also recognizes identities passed from the SSO provider.
To run a connector, you need the software for the connector manager and the connector. The following table lists methods for obtaining the software components that you need to use connectors, as well as the support provided for each component.
| Component | Obtain by | Support |
|---|---|---|
| Source code for the connector manager and connectors | Download the code from the Google Enterprise Connector Manager project on code.google.com. | The open-source software is for the development of third-party connectors. Developers using the resources provided in this project can create connectors for virtually any type of document-based repository. Google does not support the open-source software or changes you make to the open-source software. |
| An installer package that deploys Apache Tomcat, a connector manager, and a particular connector type | Download the package from Google Enterprise Support web site. | Google supports the installer and the software packaged with the installer. |
Before you configure a connector, install the following software components:
The specific process that you follow for configuring a connector depends on the type of connector. Generally, you can configure a connector by performing the following steps:

For in-depth information about connectors, refer to the following Google Search Appliance documents:
During crawl, the search appliance finds most of the content that it indexes by following links within documents. However, many organizations have content that cannot be found this way because it is not linked from other documents. If your organization has content that cannot be found through links on crawled web pages, you can ensure that the Google Search Appliance indexes it by using Feeds. Feeds are also useful for the following types of content:
You can also use feeds delete data from the index on the search appliance.
The Google Search Appliance Supports two types of feeds, as described in the following table.
| Type | Description |
|---|---|
| Web feed | A web feed does not provide content to the Google Search Appliance. Instead, a web feed provide a list of URLs to the search appliance. Optionally, a web feed may include metadata. The crawler queues the URLs listed in the web feed and fetches content for each document listed in the feed. Web feeds are incremental. The search appliance recrawls web feeds periodically, based on the crawl settings for your search appliance. |
| Content Feed | A content feed provides both URLs and their content to the search appliance. A content feed may include metadata. A content feed can be either full or incremental. The search appliance only crawls content feeds when they are pushed. |
The following figure provides an overview of indexing hard-to-find content by using feeds.

To push a content feed to the search appliance, you must provide the following components:
You can use one of the feed clients described in Feeds Protocol Developer's Guide or write your own. For information about writing a feed client, refer to Writing Applications with the Feeds Protocol.
URL Patterns and Trusted IP lists that you define with the Admin Console ensure that your index only lists content from desirable sources. When pushing URLs with a feed, you must verify that the Admin Console will accept the feed and allow your content through to the index. For a feed to succeed, it must be fed from a trusted IP address and at least one URL in the feed must pass the rules defined on the Admin Console.
Push a content feed to the search appliance by performing the following steps:

For complete documentation on feeds, refer to the Feeds Protocol Developers Guide.
The Google Search Appliance can also index records in a relational database. The Google Search Appliance supports indexing of the following relational database management systems:
The search appliance provides access to data stored in relational databases by crawling the content directly from the database and serving the content. The process of crawling a database is called "synchronizing a database." To access content in a database, the Google Search Appliance sends SQL (Structured Query Language) queries using JDBC (Java Database Connectivity) adapters provided by database companies. It crawls the contents of the database and then pushes records from a database into the search appliance’s index using feeds.
The following figure provides an overview of indexing content in databases.

Synchronize a database by performing the following tasks with the Admin Console:

For in-depth information about how the Google Search Appliance indexes and serves database content, as well as a complete list of databases and JDBC adapter versions that the Google Search Appliance supports, refer to Database Crawling and Serving.
Google Apps provide your organization with tools for collaborating on documents, spreadsheets, presentations, and sites. You can integrate a search appliance with Google Apps, which enables the search appliance to index content and serve results from a Google Apps domain's public Google Docs and Google Sites content.
The following figure shows search results from Google Apps, listed above the other results.

When a Google Search Appliance integrates with Google Apps, the search appliance crawls and serves only public content from a Google Apps domain. Any Google Apps content that has not been published or shared is not crawled.
Before a search appliance can integrate with Google Apps, there must be a Google Apps domain with content that the search appliance can crawl and index. To sign up for Google Apps, visit the Google Apps editions page. Also, ensure that the content on Google Apps that you want to index is public.
Set up Google Apps integration by enabling it on the Google Apps > Google Apps Integration page in the Admin Console. The following figure shows the Google Apps > Google Apps Integration page.

For an in-depth information about setting up and using Google Apps Integration, refer to Integrating with Google Apps.
Once the content has been crawled and indexed, you can ensure that it is searchable by using the Test Center. The Test Center enables you to test search across the indexed content, limiting it to specific collections or using specific front-ends and verifying that the correct content is indexed and that the results are what you expect.
You can find a link to the Test Center at the upper right side of the Admin Console. When you click the Test Center link, a new browser window opens and displays the Test Center page, as shown in the following figure.

Let Google know what you think about this document by sending feedback to gsadoc-gtm-feedback@google.com.