My favorites | English | Sign in

Faster apps faster - GWT 2.0 with Speed Tracer New!

Google Search Appliance

Getting the Most from Your Google Search Appliance: Crawl and Index

Google Search Appliance software version 6.0
Posted June 2009

The Google Search Appliance enables you to provide universal search to your users. You can get the most from your Google Search Appliance by using some or all of its many features to fine-tune and enhance universal search. Become familiar with the Google Search Appliance's features by reading this document and apply those features that best suit your search solution.

Google Search Appliance

Contents

  1. Crawling and Indexing
  2. Crawling Public Content
    1. What Content is Not Crawled?
    2. Configuring Crawl of Public Content
    3. Learn More about Public Crawl
    4. Public Crawl Tips and Best Practices
  3. Crawling and Serving Controlled-Access Content
    1. Configuring Crawl of Controlled-Access Content
    2. Managing Serve of Controlled-Access Content
    3. Learn More about Controlled-Access Content
  4. Indexing Content in Non-Web Repositories
    1. Serving Results from Content Management Systems
    2. Obtaining the Connector Manager and Connectors
    3. Configuring a Connector
    4. Learn More about Connectors
  5. Indexing Hard-to-Find Content
    1. Pushing a Feed to the Search Appliance
    2. Learn More about Feeds
  6. Indexing Database Content
    1. Synchronizing a Database
    2. Learn more about Database Synchronization
  7. Indexing Google Apps Content
    1. Setting Up Google Apps Integration
    2. Learn More about Google Apps Integration
  8. Testing Indexed Content
    1. Best Practices for Testing Indexed Content

Crawling and Indexing

After the Google Search Appliance has been set up, you can configure the search appliance to crawl the content sources that you identified during the planning phase, as described in Identifying Content Sources.

Crawl is the process by which the Google Search Appliance discovers enterprise content and creates a master index. The resulting index consists of all of the words, phrases, and meta-data in the crawled documents. When users search for information, their queries are executed against the index rather than the actual documents themselves.

The Google Search Appliance can crawl:

The Google Search Appliance is also capable of indexing:

This section briefly describes how the Google Search Appliance indexes each type of content.

Crawling Public Content

Public content is not restricted in any way; users don't need credentials to view it. Some of the most common forms of public content include:

  • Employee portals
  • Frequently Asked Questions
  • Employee policies
  • Benefits information
  • Product documentation
  • Marketing literature

The Google Search Appliance supports crawling of many types of formats, including word processing, spreadsheet, presentation, and others.

The Google Search Appliance crawls content on web sites or file systems according to crawl patterns that you specify by using the Admin Console. As the search appliance crawls public content sources, it indexes documents that it finds. To find more documents, the crawler follows links within the documents that it indexes. The search appliance does not crawl content that you to exclude from the index.

The following figure provides an overview of crawling public content.

simple crawl diagram

What Content Is Not Crawled?

The Google Search Appliance does not crawl unlinked URLs or links that are embedded within an area tag. Also, the search appliance does not crawl or index content that is excluded by these mechanisms:

  • Do not follow and crawl URLs that you specify by using the Crawl and Index > Crawl URLs page in the Admin Console
  • robots.txt file--The Google Search Appliance always obeys the rules in robots.txt and it is not possible to override this feature. Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content
  • nofollow robots META tags that appear in content sources

Typically, webmasters, content owners, and search appliance administrators create robots.txt files and add META tags to documents before a search appliance starts crawling.

Configuring Crawl of Public Content

To configure a search appliance to crawl a content source, you specify top-level URLs and directory addresses and links that the search appliance should follow by using the Crawl and Index > Crawl URLs page in the Admin Console. In addition to specifying start URLs, you can also specify URLs that the search appliance should not follow and crawl.

By default, the search appliance crawls in continuous crawl mode. This means that after the Google Search Appliance creates the index, it always crawls content sources looking for new or modified content and updates the index to ensure that it contains the freshest listings. The search appliance can also crawl content according to a schedule.

Configure continuous crawl by performing the following steps with the Admin Console:

  1. Specifying where to start the crawl by listing top-level URLs and directory addresses in the Start Crawling from the Following URLs section on the Crawl and Index > Crawl URLs page, shown in the following figure.
  2. Specifying links for the search appliance to follow and index by listing patterns in the Follow and Crawl Only URLs with the Following Patterns section.
  3. Listing any URLs that you don't want the search appliance to crawl in the Do Not Crawl URLs with the Following Patterns section.
  4. Saving the URL patterns.

crawl urls page

After you save the URL patterns, the search appliance begins crawling in continuous mode.

If you prefer to have the search appliance crawl according to scheduled times, you must also perform the additional following tasks by using the Crawl and Index > Crawl Schedule page in the Admin Console:

  1. Selecting scheduled crawl mode.
  2. Creating a crawl schedule.
  3. Saving the crawl schedule.

To schedule crawling times for a specific host, you can change the host load and times in the Crawl and Index > Host Load Schedule page. By setting a host load of 0, the crawler will not crawl that host during the configured time period.

If you wish to have a document added to the crawl queue right away, then you can do so by entering in the URL in Re-Crawl These URL Patterns on the Crawl and Index > Freshness Tuning page.

Learn More about Public Crawl

For in-depth information about public crawl, configuring a search appliance to crawl, and starting a crawl, refer to Administering Crawl for Web and File Share Content.

For a complete list of file types that the search appliance can crawl, refer to Indexable File Formats.

Public Crawl Tips and Best Practices

 

  • If in doubt about what to crawl, have the search appliance crawl as much content as possible. The most common reason users are unable to find relevant information with the search appliance is that the content they're interested in wasn't crawled. If a document is important to the information flow of your organization, make sure it gets into your index. Additionally, the Google search algorithm gets better as you add more information. This practice applies to both public and controlled-access content.
  • In some situations, the Google Search Appliance may not be able to access all content on a particular site, due to JavaScript-based navigation or forms. In this situation, Google recommends that you include a site map or jump page as a starting point in the crawl patterns. A site map or jump page is a page of hypertext links that allows users or robots to navigate to all pages within a web site.

Back to top

Crawling and Serving Controlled-Access Content

Controlled-access content is secure content--it is restricted so that not all users have access to it. For access to controlled-access content, users need authorization.

A search appliance discovers and indexes controlled-access content in the same way that it indexes all other content: by performing a crawl through the content sources. However, the search appliance requires access credentials to discover and index controlled-access content. Once you set up the search appliance with access credentials, it maintains a copy of all crawled content in the index.

The following figure provides an overview of crawling controlled-access content.

crawling secure content

Controlled-access methods that the Google Search Appliance supports for crawl include:

  • HTTP Basic Authorization
  • NTLM HTTP
  • Forms Authentication
  • x.509 certificate authorities

Configuring Crawl of Controlled-Access Content

If the content files you want crawled and indexed are in a location that requires a login, create a special user account on your network for the search appliance. When you configure crawl on the Admin Console, provide the user name and password for that account. The search appliance presents those credentials before crawling files in that location.

Configure a search appliance to crawl controlled-access content by performing the following steps with the Admin Console:

  1. Configuring the crawl as described in Configuring Crawl of Public Content, but also providing the search appliance with URL patterns that match the controlled content.
  2. Specifying access credentials for each URL pattern by using the appropriate Admin Console pages. The means by which you provide these credentials is different for each kind of authentication:
    • For HTTP Basic and NTLM HTTP, use the Crawl and Index > Crawler Access page
    • For HTTPS web sites, the search appliance uses a serving certificate as a client certificate when crawling. Upload a new certificate by using the Administration > Certificate Authorities page

      The following figure shows the Crawl and Index > Crawler Access page.

crawler access page

Managing Serve of Controlled-Access Content

When a user issues a search request for controlled-access content, the search appliance verifies the user's identity and determines whether the user has authorization to view the content. This check is performed before the search appliance displays any content in search results. By performing the results access control checks in real-time, the Google Search Appliance ensures that users only see results they are authorized to view.

A search appliance can use the following methods to establish the user's identity:

  • HTTP Basic or NTLM HTTP with authentication against an LDAP server
  • IWA (Integrated Windows Authentication) / Kerberos authentication against a domain controller.
  • HTML Forms-based Authentication
  • The SAML Authentication and Authorization Service Provider Interface (SPI)
  • Digital Certificates and Certification Authorities

Once the user's identity has been established, a search appliance attempts to determine whether the user has access to the secure content that matches their search. The authorization check is performed based on the security configuration of the search appliance:

  • If the search appliance is configured to use the SAML Authentication and Authorization SPI, the search appliance sends a SAML authorization request to the Policy Decision Point, using the identity obtained for the user during serve authentication.

    For more information about the SAML Authentication and Authorization SPI, refer to Integrating with an Existing Access Control Infrastructure.
  • For secure content that was crawled using HTTP Basic or NTLM HTTP authentication, the search appliance performs a HEAD request for the document, using the credentials obtained for the user during serve authentication.
  • For secure content that was crawled using Forms Authentication, the search appliance performs a GET request for 0 bytes of the document, using the credentials obtained for the user during serve authentication.

If the authorization check is successful, the secure content that matches the search query is included in the user's search results.

Configuring Serve of Controlled-Access Content

The process for configuring serve of controlled-access content is dependent on the security method you want to use, as described in the following list:

  • To enable the search appliance to authenticate credentials against an LDAP server, use the Administration > LDAP Setup page in the Admin Console.
  • To enable the search appliance to use IWA/Kerberos authentication during secure serve, use the Crawl and Index > Crawler Access page.
  • To configure a search appliance to perform forms authentication, use the Serving > Forms Authentication page.
  • To configure the search appliance to use the Authentication or Authorization SPI, use the Serving > Access Control page.
  • To configure the search appliance to require X.509 Certificate Authentication for search requests from users, use the Administration > Certificate Authorities page.

Learn More about Controlled-Access Content

For complete information about configuring a search appliance to crawl and serve controlled-access content, refer to Managing Search for Controlled-Access Content.

Back to top

Indexing Content in Non-Web Repositories

If your organization has content that is stored in non-web repositories, such as Enterprise Content Management (ECM) systems, you can enable the Google Search Appliance to index and serve this content by using the connector framework.

The Google Search Appliance provides the indexing capabilities for the following content management systems:

  • Microsoft SharePoint Portal Server
  • Microsoft SharePoint Services
  • EMC Documentum
  • Open Text LiveLink Enterprise Server
  • IBM FileNet Content Manager

Also, Google partners have developed connectors for other non-web repositories. For information about these connectors, visit Google Solutions Marketplace.

The connector manager is the central part of the connector framework for the Google Search Appliance. The Connector Manager itself manages creation, instantiation, scheduling and monitoring of connectors that supply content and provide authentication and authorization services to the Google Search Appliance. Connectors run on connector managers residing on servlet containers installed on computers on your network. All Google-supported connectors are certified on Apache Tomcat 5.5.23.

When connecting to a document repository through an enterprise connector, the Google Search Appliance uses a process called "traversal." During traversal, the connector issues queries to the repository to retrieve document data to feed to the Google Search Appliance for indexing. The connector manager formats the content and any associated metadata for a feed to the Google Search Appliance, which then creates an index of the documents.

The following figure provides an overview of indexing content in non-web repositories.

connector overview

You can also create a custom connector for the Google Search Appliance, as described in Developing Custom Connectors.

Serving Results from a Content Management System

For public content in a repository, searches work the same way as they do with web and file-system content. The Google Search Appliance searches its index and returns relevant result sets to the user without any involvement by the connector.

To authorize access to private or protected content from a repository, the Google Search Appliance creates a connector instance at query time. The connector instance forwards authentication credentials to the repository for authorization checking. The connector manager recognizes identities passed from basic authentication, SAML authentication, and client certificates. If a SAML authentication provider is setup to support single sign-on (SSO), the connector manager also recognizes identities passed from the SSO provider.

Obtaining the Connector Manager and Connectors

To run a connector, you need the software for the connector manager and the connector. The following table lists methods for obtaining the software components that you need to use connectors, as well as the support provided for each component.

Component Obtain by Support
Source code for the connector manager and connectors Download the code from the Google Enterprise Connector Manager project on code.google.com. The open-source software is for the development of third-party connectors. Developers using the resources provided in this project can create connectors for virtually any type of document-based repository. Google does not support the open-source software or changes you make to the open-source software.
An installer package that deploys Apache Tomcat, a connector manager, and a particular connector type Download the package from Google Enterprise Support web site. Google supports the installer and the software packaged with the installer.

Configuring a Connector

Before you configure a connector, install the following software components:

  • The appropriate Java Development Kit (JDK) for the content management system
  • Apache Tomcat 5.5.23
  • Native client libraries required by the content management system

The specific process that you follow for configuring a connector depends on the type of connector. Generally, you can configure a connector by performing the following steps:

  1. Installing a connector on a host running Apache Tomcat.
  2. Registering a connector manager by using the Connector Administration > Connector Managers page in the Admin Console.
  3. Adding a connector by using the Connector Administration > Connectors page, shown in the following figure.

    add connectors page

  4. Configuring crawl patterns by using the Crawl and Index > Crawl URLs page.
  5. If required by the connector, configuring feeds by using the Crawl and Index > Feeds page.
  6. If required by the connector, configuring secure crawling of the content management system by using the Admin Console page that is appropriate for the specific connector.
  7. Restarting the connector.
  8. Verifying that the search appliance is indexing URLs from the connector by using the Status and Reports > Crawl Diagnostics page.

Learn More about Connectors

For in-depth information about connectors, refer to the following Google Search Appliance documents:

Back to top

Indexing Hard-to-Find Content

During crawl, the search appliance finds most of the content that it indexes by following links within documents. However, many organizations have content that cannot be found this way because it is not linked from other documents. If your organization has content that cannot be found through links on crawled web pages, you can ensure that the Google Search Appliance indexes it by using Feeds. Feeds are also useful for the following types of content:

  • Documents that should be crawled at specific times that are different from those set in the crawl schedule
  • Documents that could be crawled, but are much more quickly uploaded using feeds.

You can also use feeds delete data from the index on the search appliance.

The Google Search Appliance Supports two types of feeds, as described in the following table.

Type Description
Web feed A web feed does not provide content to the Google Search Appliance. Instead, a web feed provide a list of URLs to the search appliance. Optionally, a web feed may include metadata. The crawler queues the URLs listed in the web feed and fetches content for each document listed in the feed. Web feeds are incremental. The search appliance recrawls web feeds periodically, based on the crawl settings for your search appliance.
Content Feed A content feed provides both URLs and their content to the search appliance. A content feed may include metadata. A content feed can be either full or incremental. The search appliance only crawls content feeds when they are pushed.

The following figure provides an overview of indexing hard-to-find content by using feeds.

 

feeds

Pushing a Feed to the Search Appliance

To push a content feed to the search appliance, you must provide the following components:

  • Feed--An XML document that tells the search appliance about the contents that you want to push
  • Feed client--An application or web page that pushes the feed to a feeder process on the search appliance

You can use one of the feed clients described in Feeds Protocol Developer's Guide or write your own. For information about writing a feed client, refer to Writing Applications with the Feeds Protocol.

URL Patterns and Trusted IP lists that you define with the Admin Console ensure that your index only lists content from desirable sources. When pushing URLs with a feed, you must verify that the Admin Console will accept the feed and allow your content through to the index. For a feed to succeed, it must be fed from a trusted IP address and at least one URL in the feed must pass the rules defined on the Admin Console.

Push a content feed to the search appliance by performing the following steps:

  1. Adding the URL for the document defined in the Feed Client to crawl patterns by using the Crawl and Index > Crawl URLs page. URLs specified in the feed will only be crawled if they pass through the patterns specified on the Crawl and Index > Crawl URLs page.
  2. Configuring the search appliance to accept the feed by using the Crawl and Index > Feeds page, shown in the following figure. To prevent unauthorized additions to your index, feeds are only accepted from machines that are specified on this page.

    feeds page
  3. Running the feed client script.
  4. Monitoring the feed by using the Admin Console.
  5. Checking for search results from the feed within 30 minutes of running the feed client script.

Learn More about Feeds

For complete documentation on feeds, refer to the Feeds Protocol Developers Guide.

Back to top

Indexing Database Content

The Google Search Appliance can also index records in a relational database. The Google Search Appliance supports indexing of the following relational database management systems:

  • IBM DB2
  • MySQL
  • Oracle
  • Microsoft SQL Server
  • Sybase

The search appliance provides access to data stored in relational databases by crawling the content directly from the database and serving the content. The process of crawling a database is called "synchronizing a database." To access content in a database, the Google Search Appliance sends SQL (Structured Query Language) queries using JDBC (Java Database Connectivity) adapters provided by database companies. It crawls the contents of the database and then pushes records from a database into the search appliance’s index using feeds.

The following figure provides an overview of indexing content in databases.

database crawl

Synchronizing a Database

Synchronize a database by performing the following tasks with the Admin Console:

  1. Creating a new database source on the Crawl and Index > Databases page, shown in the following figure.

    databases page
  2. Setting URL patterns that enable the search appliance to crawl the database by using the Crawl and Index > Crawl URLs page.
  3. Starting a database synchronization by using the Crawl and Index > Databases page.

Learn More about Database Synchronization

For in-depth information about how the Google Search Appliance indexes and serves database content, as well as a complete list of databases and JDBC adapter versions that the Google Search Appliance supports, refer to Database Crawling and Serving.

Back to top

Indexing Google Apps Content

Google Apps provide your organization with tools for collaborating on documents, spreadsheets, presentations, and sites. You can integrate a search appliance with Google Apps, which enables the search appliance to index content and serve results from a Google Apps domain's public Google Docs and Google Sites content.

The following figure shows search results from Google Apps, listed above the other results.

results from Google Apps

When a Google Search Appliance integrates with Google Apps, the search appliance crawls and serves only public content from a Google Apps domain. Any Google Apps content that has not been published or shared is not crawled.

Setting Up Google Apps Integration

Before a search appliance can integrate with Google Apps, there must be a Google Apps domain with content that the search appliance can crawl and index. To sign up for Google Apps, visit the Google Apps editions page. Also, ensure that the content on Google Apps that you want to index is public.

Set up Google Apps integration by enabling it on the Google Apps > Google Apps Integration page in the Admin Console. The following figure shows the Google Apps > Google Apps Integration page.

Google apps integration page

Learn More about Google Apps Integration

For an in-depth information about setting up and using Google Apps Integration, refer to Integrating with Google Apps.

Back to top

Testing Indexed Content

Once the content has been crawled and indexed, you can ensure that it is searchable by using the Test Center. The Test Center enables you to test search across the indexed content, limiting it to specific collections or using specific front-ends and verifying that the correct content is indexed and that the results are what you expect.

You can find a link to the Test Center at the upper right side of the Admin Console. When you click the Test Center link, a new browser window opens and displays the Test Center page, as shown in the following figure.

test center

Best Practices for Testing Indexed Content

  • Run test queries against your new index. Enter keywords that seem likely to return specific documents in search results.
  • The search appliance might crawl more content than you expect, including sensitive content. You might want to use the Test Center to run some sample queries for certain key terms (SSN, social security number, raise, reduction in force, or others that make sense for your organization) to ensure that sensitive content is not accessible.
  • Test the relevancy of the search results. Do keyword searches return all the relevant documents? If not, you may want to add synonyms by using query expansion or use related queries.
  • Are important documents for the search terms appearing first in the search results? If not, you may want to use KeyMatches or result biasing to make adjustments to the rankings.

 

Let Google know what you think about this document by sending feedback to gsadoc-gtm-feedback@google.com.

Back to top