Google Search Appliance software version 6.0
Posted June 2009
With Google Apps Integration, you can use your Google Search Appliance to search your domain's Google Apps content. In a few clicks, you can enable your search appliance to crawl, index, and serve public Google Sites and Google Docs.

This document provides information about integrating a Google Search Appliance with Google Apps. This document is intended for search appliance administrators who need to understand:
This document also provides guidance for importing or exporting a search appliance configuration that includes a Google Apps integration, as well as controlling crawl and managing serving content in a Google Apps domain. The following table lists the major sections in this document.
| Section | Describes |
|---|---|
| Overview | How Google Apps integration enhances search results |
| Activating Google Apps Integration | How to add Google Apps Integration to a search appliance |
| Enabling or Disabling Google Apps Integration | How to enable a search appliance to integrate with Google Apps |
| Importing or Exporting Google Apps Integration Configurations | How to export a search appliance configuration that includes an integration with Google Apps and what happens to a search appliance when you import such a configuration |
| Writing URL Patterns for Google Apps Content | How to write valid URL patterns for managing crawl and serve of content in a Google Apps domain |
| Managing Crawl of Google Apps Content | How to control and monitor crawl of content in a Google Apps domain |
| Managing Serve of Google Apps Content | How to manipulate Google Apps URLs in search results and remove them from the search index |
Several sections in this document use a Google Apps example domain called nucleotraining.com. This domain is used by the training group at a fictional company called Nucleo Worldwide Systems and contains the following content:
Google Apps provide your organization with tools for collaborating on documents, spreadsheets, presentations, and sites. As a search appliance administrator, you can integrate your search appliance with Google Apps. This integration enables the search appliance to crawl, index, and serve results from a Google Apps domain's public Google Docs and Google Sites content.
The following steps provide an overview of the entire process of integrating a search appliance with Google Apps:
In release 5.2, the Google Search Appliance integrates with a subset of Google Apps services, as listed in the following table.
| Google Apps Service | Integrated ? |
|---|---|
| Google Docs (documents, spreadsheets, and presentations) | Yes |
| Google Sites | Yes |
| Gmail | No |
| Google Talk | No |
| Google Calendar | No |
| Google Video and other services | No |
To use Google Apps integration, a search appliance must be able to access Google.com. Before a search appliance can integrate with Google Apps, there must be a Google Apps domain with content that the search appliance can crawl and index.
To sign up for Google Apps, visit the Google Apps editions page. When you sign up for Google Apps, you select:
Each Google Apps domain can have multiple administrators. As a search appliance administrator, you must have a domain name, a Google Apps administrator username, and a password to enable Google Apps integration.
When a Google Search Appliance integrates with Google Apps, the search appliance crawls and serves only public content from a Google Apps domain. Any Google Apps content that has not been published or shared is not crawled. The following subsections describe how to make content in Google Docs and Google Sites public.
To make a Google Doc public, you must:
To share a document, presentation, or spreadsheet with everyone in the same domain:
To publish a document as a web page:
To publish a presentation as a web page:
To publish a spreadsheet as a web page:
For more information about publishing and sharing Google Docs, refer to the Google Docs Help Center.
To make a site public, you can:
To share a site with everyone in the same domain:
To make a new site public:
To make an existing site public:
As a Google Enterprise Labs feature, Google Apps Integration is hidden from view in the Google Search Appliance by default. To activate Google Apps Integration, enter the following URL in your browser:
http://<hostname>:8000/EnterpriseController?actionType=googleApps
where <hostname> is the hostname of your search appliance
After you activate Google Apps Integration, Google Apps appears in the search appliance Admin Console navigation bar. You can access it by clicking Google Apps > Google Apps Integration.
As a search appliance administrator, you do not need to perform any configuration tasks to integrate a search appliance with Google Apps. You only need to enable Google Apps integration by using the Google Apps > Google Apps Integration page in the Admin Console.
When you enable Google Apps integration, the search appliance configures itself to:
For information about using the Google Apps Integration page, click Help Center > Google Apps > Google Apps Integration in the Admin Console.
For one search appliance, you can enable Google Apps integration with one Google Apps domain. Each time you enable Google Apps Integration, the search appliance downloads the latest list of Google Apps services and access control policies. Check this document for a description of the latest functionality. To enable Google Apps integration, you need the following information for the Google Apps domain:
For information about how to get this information, refer to Prerequisites for Using Google Apps Integration.
Each time you enable Google Apps Integration, the search appliance downloads the latest list of Google Apps services and access control policies. Check this document for descriptions of the latest functionality.
For example, suppose that you want to enable integration with the nucleotraining.com domain. Your Google Apps administrator name is admin5@nucleotraining.com and you have the Google Apps Administrator password for the domain.
To enable Google Apps integration for nucleotraining.com:
The Google Apps Integration page appears.
Even if you change the administrator password or the administrator account is deleted, the integration will continue to work.
To disable Google Apps integration:
The Google Apps Integration page appears.
The Admin Console displays the following message: Disabled Google Apps Integration.
To re-enable the integration, follow the procedure in Enabling an Integration.
When a search appliance configures itself to crawl Google Apps content, it does not display URLs for Google Docs or Google Sites in the Admin Console. However, as a search appliance administrator, you may need to enter Google Apps URL patterns to manage crawl or serve.
The following table lists example URL patterns for types of content in a Google Apps domain. For individual items in a Google Apps domain, such as a specific document, copy the URL from a listing on a search results page and paste it in the Admin Console page where you are working. Google Apps supports both public (http) and secure (https) sites.
| Content | URL Patterns |
|---|---|
| All documents, presentations, and spreadsheets in a domain | docs.google.com/ and spreadsheets.google.com/ |
| All documents in a domain | ^http://docs.google.com/a/domain_name.com/View and ^https://docs.google.com/a/domain_name.com/View |
| A specific document in a domain | The full URL of the document, for example: http://docs.google.com/a/domain_name.com/Doc?docid=dg4sgjw7_18cp3vsbfz&hl=en or https://docs.google.com/a/domain_name.com/Doc?docid=dg4sgjw7_18cp3vsbfz&hl=en |
| All presentations in a domain | ^http://docs.google.com/a/domain_name.com/SimplePresentationView and ^https://docs.google.com/a/domain_name.com/SimplePresentationView |
| A specific presentation in a domain | The full URL of the presentation, for example: ^http://docs.google.com/a/domain_name.com/Presentation?docid=dg4sgjw7_0d5m8vzgw&hl=en or ^https://docs.google.com/a/domain_name.com/Presentation?docid=dg4sgjw7_0d5m8vzgw&hl=en |
| All spreadsheets in a domain | spreadsheets.google.com/ |
| A specific spreadsheet in a domain | The full URL of the spreadsheet, for example: ^http://docs.google.com/a/domain_name.com/ccc?key=pugnm4XXrq5DeFcreLXRibQ&hl=en or ^https://docs.google.com/a/domain_name.com/ccc?key=pugnm4XXrq5DeFcreLXRibQ&hl=en |
| All sites in a domain | sites.google.com/ |
| A specific site in a domain | The URL of the site, for example: ^http://sites.google.com/a/domain_name.com/site_name/Home or ^https://sites.google.com/a/domain_name.com/site_name/Home |
The Google Search Appliance crawls content in a Google Apps domain the same way that it crawls other content. For general information about how the Google Search Appliance crawls content, refer to Administering Crawl for Web and File Share Content.
In continuous crawl mode, the search appliance crawls Google Apps (and other) content at all times, ensuring that newly added or updated content is added to the index as quickly as possible. The starting point for crawling Google Apps is the docs publish index, which is updated once a day.
The search appliance can automatically determine URLs that often change and should be crawled frequently and URLs that seldom change and should be crawled infrequently. The search appliance can also crawl in scheduled crawl mode, where it crawls content at a scheduled time.
For a search appliance to crawl content in a Google Apps domain, you do not need to specify any follow and crawl URL patterns. In fact, the Google Apps integration crawl URLs are hidden and you cannot delete them. However, you can manage crawling of Google Apps as described in the following sections:
Each document, presentation, and spreadsheet that is crawled is counted against the search appliance's license limit. For sites, each page in a site that is crawled is counted against the license limit. Any public Google docs that are embedded in a sites page are considered separate pages and are recrawled.
As a search appliance administrator, you can control the content in a Google Apps domain that is crawled. To exclude URLs from crawling, use the Crawl and Index > Crawl URLs page in the Admin Console.
For example, in the domain nucleotraining.com, the search appliance crawls content that is of interest to all members of the training group. This content includes documents, presentations, and sites. However, because spreadsheets contain information that is only of interest to course registrars, the search appliance should not crawl spreadsheets.
To exclude all spreadsheets from a crawl:
The Crawl URLs page appears.
You can also exclude an individual URLs from a crawl by typing it in this box. For information about valid Google Apps URLs, refer to Writing URL Patterns for Google Apps Content.
For more information about controlling crawl, refer to Administering Crawl for Web and File Share Content. For information about using the Crawl and Index > Crawl URLs page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.
While the search appliance is crawling, you can monitor a crawl's history on the Status and Reports > Crawl Diagnostics page in the Admin Console.
When this page first appears, it shows the crawl history for the current domain. From the domain level, you can navigate to lower levels that show crawl history for Google Apps URLs. URLs for content in Google Apps domains follow the patterns described in Writing URL Patterns for Google Apps Content.
For domain crawling, "Unknown" or "Crawled with empty body: Disallowed by robots.txt" crawl statuses do not indicate errors.
The following table lists the hierarchical levels that you can navigate to and describes the information that the Status and Reports > Crawl Diagnostics page displays at each level.
| Level | Page Shows |
|---|---|
| Domain | The number of URLs that have been crawled in all Google Apps hosts in the domain plus other information. Hosts include docs.google.com and sites.google.com. |
| Host | The number of URLs that have been crawled in the selected Google Apps host plus other information. For example, this level shows information for http://sites.google.com. |
| Directory | The crawl status for the Google Apps directory (http://sites.google.com/a/) or subdirectories (http://sites.google.com/domain/...). |
| URL | Detailed information about the crawled URL and a crawl history for the URL. You can also use this page to recrawl the current URL. |
For example, suppose you want to monitor crawling of the site NucleoTraining in the nucleotraining.com domain. To monitor NucleoTraining, navigate to the Crawl Diagnostics page for the site's URL:
The domain-level Crawl Diagnostics page appears.
The host-level Crawl Diagnostics page appears.
The directory-level Crawl Diagnostics page appears.
The URL-level Crawl Diagnostics page for http://sites.google.com/a/nucleotraining/NucleoTraining appears.
For more information about monitoring a crawl, refer to Administering Crawl for Web and File Share Content. For information about using the Status and Reports > Crawl Diagnostics page, click Help Center > Status and Reports > Crawl Diagnostics in the Admin Console.
After a search appliance has integrated with Google Apps, it can return search results from a Google Apps domain to users. The following figure illustrates search results from a Google Apps domain.

You can manage serving content from a Google Apps domain the same way you manage serving other crawled content. For general information about how to manage serve, refer to Creating the Search Experience.
In listings, search results from Google Apps services are identified by icons, as illustrated in the following table.
| Icon | Identifies Result From |
|---|---|
Google Docs--document |
|
Google Docs--presentation |
|
Google Docs--spreadsheet |
|
Google Sites--site |
The framework for managing how the search appliance serves content from a Google Apps domain (or any source) to users is the front end. There are several search appliance features associated with a front end, including features that refine search results. With a single search appliance, you can create and deploy an unlimited number of front ends. This means that you can customize how the search appliance serves content from a Google Apps domain to different types of users.
To create a front end, use the Serving > Front Ends page in the Admin Console. For complete information about the Front Ends page, click Help Center >Serving> Front Ends in the Admin Console. For more information about using front ends, refer to Creating the Search Experience.
For example, in a search appliance that is integrated with nucleotraining.com, you might create two front ends:
In the NucleoGroupMembers front end, the search appliance serves all content in the nucleotraining.com domain. In the NucleoStudents front end, the search appliance only serves content that is appropriate for students, including course descriptions, course modules, course presentations, and class schedules.
The following table lists front end features that help you manage how content from a Google Apps domain is served to users. For more information about using a feature to manage serve, refer to the section listed in the Reference column.
| Front End Feature | Reference |
|---|---|
| Remove URLs | Preventing Google Apps content from appearing in search results |
| Removing Google Apps content from the search index | |
| Result biasing | Biasing Google Apps content in search results |
| Filters | Filtering Google Apps content in search results |
A collection is another search appliance feature that can help you manage serve of Google Apps content. Collections are independent of front ends. However, you can use a custom front end with a specific collection to help improve searches and enhance results. For more information about using collections, refer to Creating Collections of Google Apps content.
For any front end, you can prevent URLs that match specific patterns from appearing in search results. To prevent URLs from appearing in search results, use the Serving > Front Ends > Remove URLs page in the Admin Console. Because removing URLs is specific to a front end, it can be aimed at a specific type of user, as shown in the following example.
The NucleoTraining site contains sensitive information about internal team projects, issues, and events. In the NucleoStudents front end, the URL for the NucleoTraining site should not appear; you need to prevent this URL from appearing in search results. In the front end for team members, you do not need to remove this URL.
To prevent the URL for the NucleoTraining site from appearing in search results in the NucleoStudents front end:
The Front Ends page appears.
The Output Format page appears.
The Remove URLs page appears.
For information about valid Google Apps URLs, refer to Writing URL Patterns for Google Apps Content.
For more information about removing URLs, refer to Creating the Search Experience. For information about using the Serving > Front Ends > Remove URLs page, click Help Center > Serving > Front Ends > Remove URLs in the Admin Console.
You can affect the order of Google Apps content in search results by using source biasing, which is a type of result biasing. Source biasing enables you to boost or weaken the relevancy score of content in the search index based on URL patterns. Boosting a score usually moves a document up in the result listings. Weakening a score usually moves it down.
Set up source biasing by performing the following tasks:
For more information about result biasing and source biasing, refer to Creating the Search Experience.
For example, in the nucleotraining.com domain, the team site contains important information that members of the team collaborate on keeping up-to-date. In the NucleoGroupMembers front end, you might want to boost the relevancy scores for sites. To do this, you might create and configure a result biasing policy named Site and then select it for use with the NucleoGroupMembers front end. Because you can create unlimited front ends for a search appliance, you might have a different result biasing policy for each front end.
To create the Site result biasing policy:
The Result Biasing page appears.
The new policy's name, Site, appears in the list of result biasing policies and is selected.
For more information about using the Serving > Result Biasing page, click Help Center > Serving > Result Biasing.
To configure the Site result biasing policy:
The Serving > Result Biasing Edit page appears.
For information about valid Google Apps URLs, refer to Writing URL Patterns for Google Apps Content.
For more information about using the Serving > Result Biasing Edit page, click Help Center > Serving > Result Biasing Edit .
To enable the Site result biasing policy, apply it to the NucleoGroupMembers front end by performing the following steps:
The Front Ends page appears.
The Output Format page appears.
The Filters page appears.
For information about using the Serving > Filters page, click Help Center > Serving > Filters.
As an administrator, you can create custom filters for a front end to ensure that the search appliance serves appropriate results to users. The search appliance provides different types of filters, including domain, language, file type, and meta tag. To ensure that the search appliance only serves results from a Google Apps domain with a front end, use a domain filter. To create a domain filter, use the Serving > Front Ends > Filters page in the Admin Console.
For example, suppose nucleotraining.com is one of many domains at Nucleo Worldwide Systems. Other domains include www.nucleoworldwidesystems.com, and www.nucleoworldwidesystems.com.uk. However, with the NucleoGroupMembers front end, you want the search appliance to serve results from nucleotraining.com only. To do this, you create a domain filter. Because you can create unlimited front ends for a search appliance, you might create different domain filters for different front ends.
To create a domain filter for nucleotraining.com:
The Front Ends page appears.
The Output Format page appears.
The Filters page appears.
For more information about using filters, refer to Creating the Search Experience. For information about using the Serving > Front Ends > Filters page, click Help Center > Serving > Front Ends > Filters in the Admin Console.
After a search appliance crawls a URL, it adds the URL to the search index, where it can be served to users in search results. However, there might be one or more Google Apps URLs that you want to remove from the search index. To remove a URL from the search index, use the Crawl and Index > Crawl URLs page in the Admin Console.
For example, suppose the nucleotraining.com domain contains an obsolete site called NucleoCoursewareAlpha and you want to remove it from the search index.
To remove the NucleoCoursewareAlpha site from the search index:
The Crawl URLs page appears.
For information about valid Google Apps URLs, refer to Writing URL Patterns for Google Apps Content.
For more information about removing URLs from the search index, refer to Administering Crawl for Web and File Share Content. For information about using the Crawl and Index > Crawl URLs page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.
As a search appliance administrator, you can create collections, which are subsets of the search index. A collection lets users:
For example, you might create a collection called GoogleApps that contains documents, presentations, spreadsheets, and sites. To narrow a search to include only documents, presentations, spreadsheets, and sites, and exclude all other content, a user could select the GoogleApps collection on the search page. All the results served from this collection would be from a Google Apps domain. To create a collection, use the Crawl and Index > Collections page in the Admin Console.
To create the GoogleApps collection:
The Collections page appears.
Either leave the Use default configuration option selected or click the Import configuration from file option.
The new collection's name, GoogleApps, appears in the list of collections and is selected.
The GoogleApps collection page page appears.
For information about valid Google Apps URLs, refer to Writing URL Patterns for Google Apps Content.
For more information about collections, refer to Creating the Search Experience. For information about using the Crawl and Index > Collections page, click Help Center > Crawl and Index > Collections in the Admin Console.
You import or export a Google Apps integration configuration by importing or exporting the configuration file for a search appliance. A search appliance configuration file contains information about front end configuration, collections, and general parameters in XML format. The default name for the configuration file is config.xml.
The search appliance configuration file contains the following information about a Google Apps integration:
To import or export a configuration file, use the Administration > Import/Export in the Admin Console. For information about using this page, click Help Center > Administration > Import/Export in the Admin Console.
When the Google Search Appliance configures itself, it creates a Google Apps role account and password that it uses to access Google Apps. An exported configuration file does not include Google Apps role account credentials.
To export a configuration file:
The Import/Export page appears.
Usually, a passphrase is the same as the Admin Console password.
When you import a configuration file, the current settings for Google Apps integration on a search appliance might or might not be preserved, depending on settings in both the file and the search appliance.
To import a configuration file:
The Import/Export page appears.
If your configuration is complex, the import process can be very slow. A configuration that contains multiple megabytes of data, has hundreds of front ends, or creates hundreds of collections can require over an hour to import.
Refer to the following table for information about how Google Apps integration settings in a configuration file affect the Google Apps integration settings on a search appliance when you import a file.
| In the Configuration File | On the Search Appliance | When You Import the Configuration File |
|---|---|---|
| Google Apps integration is enabled for a specific domain, for example, domain1.com. | Google Apps integration is disabled. | The search appliance prompts you to enable Google Apps integration. |
| Google Apps integration is enabled for the same domain (domain1.com). | The integration continues to be enabled. | |
| Google Apps integration is enabled for a different domain (domain2.com). | The search appliance prompts you to enable Google Apps integration. The domain in the configuration file overrides the domain on the search appliance. | |
| Google Apps integration is disabled. | Google Apps integration is enabled. | The search appliance disables Google Apps integration. |