Configuring the Google Enterprise Connector for Microsoft SharePoint

Google Search Appliance software version 5.0
Connector software versions 1.0 and 1.1.0
Posted October 2007
Revised December 2007: Added support for Microsoft SharePoint Portal Server 2003 (SPS 2003) and Microsoft Windows SharePoint Services 2.0 (WSS 2.0), added logging information, expanded connector configuration section.
Revised January 2008: Updates and corrections
Revised March 2008: Updates to manual installation section; correction to Indexing SharePoint Sites That Have Many Subsites; corrections to the section on upgrading
Revised June 2008: Clarified user privileges for user running the installer


This document is for Google Search Appliance administrators who want to set up and manage the Google Enterprise Connector for Microsoft SharePoint. Read this document and the following related documents:

This document is for SharePoint Server administrators and administrators who install and configure the Google Search Appliance. If you are working with the Google Enterprise Connector for Microsoft SharePoint and you are not familiar with SharePoint, work closely with a SharePoint system administrator to determine the correct values for installing and configuring the connector.

Contents

  1. Introduction
  2. Supported SharePoint Versions
  3. Supported Operating Systems
  4. How SharePoint Sites are Indexed
    1. Indexing SharePoint Sites That Have Many Subsites
  5. Objects that are Indexed
    1. Microsoft SharePoint Portal Server 2003 and Microsoft Windows SharePoint Services 2.0
    2. Microsoft Office SharePoint Server 2007 and Microsoft Windows SharePoint Services 3.0
  6. How Security is Supported
    1. Serve Time User Authorization and Document Access Control
    2. Required User Credentials for Traversal and Indexing
  7. How Host Aliases are Supported
  8. Configuring Microsoft SharePoint Server for the Connector
    1. Using a robots.txt File to Enable or Restrict Crawl
    2. Configuring SharePoint 2007 to Use Fully Qualified Domain Names
  9. Installing the Google Enterprise Connector for Microsoft SharePoint Server
    1. Upgrading the Connector
    2. Installing the Connector Using the Installer
    3. Installing the Connector Manually
  10. Configuring the Connector and Crawl Patterns on the Admin Console
    1. Registering the Connector Manager
    2. Configuring the Crawl Patterns
    3. Providing User Credentials for the Crawler
    4. Configuring a Connector Instance
    5. Scheduling the Connector
    6. Restarting the Connector
    7. Verifying that the Connector is Working
  11. Forcing a Recrawl of SharePoint Content
  12. Troubleshooting
    1. Logging
    2. Error Messages
  13. Related Documentation

Introduction

The Google Enterprise Connector for Microsoft SharePoint enables the Google Search Appliance to traverse documents and attachments on SharePoint sites. Instances of the connector fetch metadata and URLs for SharePoint documents and attachments using SharePoint Web services and direct the data to the Google Search Appliance as a metadata and URL feed.

Supported SharePoint Versions

This connector is supported on the following SharePoint versions:

Supported Operating Systems

The Google Enterprise Connector for Microsoft SharePoint Server 2003 and Microsoft SharePoint Server 2007 is supported on the following operating systems:

How SharePoint Sites are Indexed

The SharePoint connector is based on metadata and URL feeds. The connector sends URLs and related metadata to the Google Search Appliance through the connector manager. Those URLs are then crawled and indexed by the Google Search Appliance.

When you configure a SharePoint connector, you designate a particular SharePoint site or subsite as a crawl URL. The connector uses that site or subsite URL as a starting point, then traverses sites under the instance and sites whose links are discovered in the SharePoint content. The links can be to sites on hosts other than the host on which the initial SharePoint instance is located.

The connector identifies sites on other hosts as SharePoint sites by calling the appropriate web services. When you configure the connector, you provide URL patterns that define locations the connector must traverse and locations the connector is prohibited from traversing. Use these patterns to include your company's domains and to exclude sites you do not control or do not want traversed.

Indexing SharePoint Sites That Have Many Subsites

Because of a known limitation in SharePoint, the connector cannot traverse SharePoint applications (root sites) that have more than one thousand (1,000) subsites. If you need to traverse a SharePoint application that has more than one thousand subsites, you must use version 1.1.2 of the SharePoint connector.

Objects that are Included

The Microsoft SharePoint connector can traverse different types of content, depending on the SharePoint version.

Regardless of which SharePoint version you use, the Google Search appliances excludes images and graphics from indexing by default. When the Google Search Appliance tries to index SharePoint Picture Libraries, you see error messages on the Crawl and Index > Feeds page on the Admin Console. See Configuring the Crawl Patterns for instructions for including image and graphics files in the index.

Microsoft SharePoint Portal Server 2003 and Microsoft Windows SharePoint Services 2.0

Under Microsoft SharePoint Portal Server 2003 and Microsoft Windows SharePoint Services 2.0, the connector can traverse the following types of content:

 Microsoft Office SharePoint Server 2007 and Microsoft Windows SharePoint Services 3.0

Under Microsoft Office SharePoint Server 2007 and Microsoft Windows SharePoint Services 3.0, the connector can traverse the following types of content:

How Security is Supported

The two sections that follow describe how security is supported during the serve process and during the traversal and indexing processes.

Serve Time User Authorization and Document Access Control

At serve time, the Google Search Appliance supports document-level authorization of each search user. Content in a SharePoint repository can be served as secure or public content.

The value of the Make Public check box on the Admin Console determines whether content is secure or public.

You can provide single-sign on capabilities (SSO) using the Google SAML Bridge for Windows. For more information, see Enabling Windows Integrated Authentication.

Required User Credentials for Traversal and Indexing

The Microsoft SharePoint connector and the Google Search Appliance require user credentials for traversal and indexing. Google recommends that you use a single user account for both. Any account you use for traversal and indexing must have Site Collection Administrator privileges in SharePoint and must be a member of the Windows local administrator group on the SharePoint host.

How Host Aliases are Supported

The SharePoint Site Alias Host name is similar to the Alternative Access Mapping in SharePoint. Both features allow multiple entry points to a particular web application, for example, a SharePoint instance used internally by one group of users and externally by partners and other trusted individuals. The entry points for the internal and external users are different URLs. In such a case, the connector uses the internal URL to traverse the SharePoint content, but the Google Search Appliance uses the external URL to crawl and serve the content. The host alias is defined for the appliance, for the crawl and serve processes, and the Crawl URL is defined for the connector to traverse.

Configuring Microsoft SharePoint Server for the Connector

This section contains instructions for configuring SharePoint for the connector.

Using a robots.txt File to Enable or Restrict Crawl

The Google Search Appliance cannot crawl the content unless a robots.txt file is present in the SharePoint site's root directory. Ensure that you create a robots.txt file and ensure that the file is public.

If you are preventing the Google Search Appliance from crawling particular content, put the correct URLs in the robots.txt file.

If you are allowing the Google Search Appliance to crawl all content, ensure that the robots.txt file includes the following syntax:

User-agent: *
Disallow:

Managed Paths and the robots.txt File

When you create a robots.txt file, you must define a managed path for the file in SharePoint. The managed path is configured differently, depending on which SharePoint version you are using.

To define a managed path under SharePoint Portal Server 2003 and Windows SharePoint Services 2.0:

  1. Log in to the Sharepoint Central Administration site.
  2. Under Virtual Server Configuration, click Configure virtual server settings.
  3. Select the correct virtual server, for example, Default Web Site.
  4. Under Virtual Server Management, click Define managed paths.
  5. Under Add a New Path, enter the following:

    /robots.txt

  6. Click Check URL.

    A browser window appears and shows a 404 error. This is the correct behavior.

  7. Select Excluded path and click OK.

    The robots.txt file is added to the excluded paths.

  8. Under Add a New Path, enter the following again:

    /robots.txt

  9. Click Check URL again.

    The contents of the robots.txt file are now displayed.

  10. Exit from the Central Administration Site.
  11. If you configured the managed path after feeds were sent to the Google Search Appliance, and you saw the error message Retrying URL: Host unreachable while trying to fetch the robots.txt file, open the Google Search Appliance Admin Console and refresh the crawl from the Crawl and Index > Freshness Tuning page.

    After some time passes, you see the SharePoint site content crawled. Note that the Freshness Tuning page is available only if the Google Search Appliance is in continuous crawl mode.

To define a managed path under SharePoint Server 2007 and Windows SharePoint Services 3.0:

  1. On the top link bar of the Central Administration Web site, click Application Management.
  2. On the Application Management page, in the SharePoint Web Application Management section, click Define managed paths.
  3. On the Define Managed Paths page, select the correct web application.
  4. On the Select Web Application page, click the web application for which you want to define managed paths.
  5. Under Add a New Path, enter the following:

    /robots.txt

  6. In the Type list, select Explicit inclusion and click OK.
  7. If you configured the managed path after feeds were sent to the Google Search Appliance, and you saw the error message Retrying URL: Host unreachable while trying to fetch the robots.txt file, open the Google Search Appliance Admin Console and refresh the crawl from the Crawl and Index > Freshness Tuning page.

    After some time passes, you see the SharePoint site content crawled. Note that the Freshness Tuning page is available only if the Google Search Appliance is in continuous crawl mode.

Configuring SharePoint 2007 to Use Fully Qualified Domain Names

This section applies only to Microsoft Office SharePoint Server 2007 and Microsoft Windows SharePoint Services 3.0.

The Google Search Appliance can crawl content only if URLs contain fully qualified host names. By default, Microsoft SharePoint Server 2007 uses short names for user access to SharePoint sites. If your SharePoint sites are configured with short names, URLs are sent to the Google Search Appliance with short names. The Google Search Appliance cannot process these short URLs. This section tells you how to configure MOSS to use fully-qualified host names.

To configure SharePoint sites to use fully-qualified domain names:

  1. Open the MOSS Central Administration tool from the Start menu.
  2. Navigate to Central Administration > Operations > Alternate Access Mappings.

    The Alternate Access Mappings dialog box displays several internal URLs for the SharePoint site and the admin site. The default settings are short URLs. If you type a fully qualified host name in the browser bar, you are redirected to a short name. For example, if you type http://moss_host1.yourdomain.com/, you are redirected to http://moss_host1/Default.aspx.

  3. Click a shortened URL.
  4. Edit the URL so that it is a fully-qualified domain name.

    For example, change http://moss_host1/ to http://moss_host1.yourdomain.com/.

  5. Click Ok.

Installing the Google Enterprise Connector for Microsoft SharePoint Server

This section describes the installation prerequisites and installation process for the Google Enterprise Connector for Microsoft SharePoint Server 2007 and Microsoft SharePoint Server 2003. 

You can install the connector using an installer that installs Apache Tomcat, a connector manager, and the connector or you can install the three software components manually. Google recommends that you use the installer unless you are building the connector manager or connector from the source code or you are installing a patch release that is not packaged with an installer. Ensure that you complete the prerequisites whether you use the installer or install the connector manually.

Upgrading the Connector

If you are running version 1.0 of the connector, you cannot upgrade directly to version 1.1.0 using the installer. Instead, use the instructions in Administering Connectors to uninstall the existing connector and install the new connector.

Installing the Connector Using the Installer

Before installing the connector using the installer, ensure that Java Development Kit (JDK) 1.4.2 is installed on the host where you are installing the connector.

The instructions that follow are in two parts. In the first part, you download and uncompress the installer package. In the second, you install the software on the connector host.

The user running the installer must have the following user privileges on the connector host:

To download and uncompress the installation package:

  1. Log in to the host using an account with sufficient privileges to install the software.
  2. Start a web browser.
  3. Navigate to the Google Enterprise Technical Support web site and log in.
  4. In the left-hand navigation bar, click Connectors.
  5. Download the correct software distribution package to the host where you are installing the software.
  6. Unzip the package.
  7. If you are on Windows, skip step 8 and go to the instructions immediately below for installing Tomcat, a connector manager, and the connector.
  8. If you are on Linux, follow these instructions.
    1. Open a terminal window and go to the base directory of the GCI.bin file in the extracted folder.
    2. Give the GCI.bin file execute permission.
    3. To run the installer in graphical mode, execute the following command:

      ./GCI.bin LAX_VM/java_location_to_java

      for example, ./GCI.bin LAX_VM /usr/java/j2sdk1.4.2_15/bin/java

    4. To run the installer in console mode, execute the command in Step 3 above with the -i console argument appended.
    5. Go to the following instructions and proceed from Step 2.

To install the connector and its supporting software:

  1. Double-click the installer executable to start the installer.
  2. Click Next.

    The Licence Agreement panel appears.

  3. Indicate whether you accept or decline the terms of the license and click Next:
    • To accept the license, click I accept the terms of the License Agreement.
    • To decline the terms, click I do NOT accept the terms of the License Agreement.
  4. On the Select Connector panel, select Microsoft SharePoint and click Next.
  5. If the Connector Selection panel is displayed, choose Install new Google Connector and click Next.
  6. On the Connector Configuration panel, enter the name you want to assign to the connector and a port number that is not already used by another application.

    The checkbox Start SharePoint Connector Service after installation determines whether the connector service start automatically on completion of the installation or must be started manually. It is checked by default.

  7. Click Next.
  8. On the Choose Java Development Kit panel, choose the correct JDK for the connector to use and or click Search for Others if the correct JDK is not in the list.

    The connector requires JDK 1.4.2.

  9. Click Next.
  10. On the Choose Install Folder panel, click Next to accept the default location or click Choose to navigate to a different folder, then click Next.
  11. On the Choose Shortcut Folder panel, indicate where you want icons created for the connector and click Next.
  12. Read the information on the Preinstallation Summary panel and click Install.

    An informational panel indicates that the connector installation is in progress. When the installation process is finished, a panel indicates that installation is complete.

  13. Click Done.

    Apache Tomcat starts and deploys the connector manager and connector.

  14. If the Start SharePoint Connector Service after installation checkbox was left unchecked in Step 6, start the connector service:
    • To start the connector as a Windows service, click Start > Programs > GoogleConnectors > connector_name > Start SharePoint Connector Service.

      You can choose the commands to stop the service or the console on the same menu.

    • To start the connector as a console on Windows, click Start > Programs > GoogleConnectors > connector_name > Start SharePoint Connector Console.
    • To start the connector as a console on Linux, open a terminal window and navigate to the installation location of the connector, then use the following command:

      ./Start_SharePoint_Connector_Console

    • To stop the connector as a console on Linux, use the following command:

      ./Stop_SharePoint_Connector_Console

  15. Use the instructions in Configuring the Connector to register the connector manager and add the connector on the Admin Console of the Google Search Appliance.

Installing the Connector Manually

You need to install the connector manually only if you have built and installed a customized connector manager or a customized version of the connector or if you are installing a patch release that is not packaged with an installer. Otherwise, Google strongly recommends that you use the installer.

Before installing the connector, ensure that the following tasks have been performed:

To install the connector manually on Apache Tomcat, follow the instructions given below:

  1. On the Tomcat host, shut down Tomcat if it is running.
  2. Navigate to the download site on code.google.com.
  3. Download the correct Binary Distribution compressed file for your platform to the Apache Tomcat host.
  4. Unzip or untar the compressed file.
  5. Copy the connector-sharepoint.jar file from the root directory to the $CATALINA_HOME/webapps/connector-manager/WEB-INF/lib directory.
  6. Copy the files in the /lib directory to the $CATALINA_HOME/shared/lib directory.
  7. Copy the catalina.jar file from the $CATALINA_HOME/server/lib/ directory to the $CATALINA_HOME/shared/lib directory.
  8. In the $CATALINA_HOME//webapps/connector-manager/WEB-INF folder, create a directory or folder called classes.
  9. Copy the logging.properties file from the /Config folder to the /classes folder.
  10. Open the logging.properties file in a text editor.
  11. Set the value of java.util.logging.FileHandler.pattern equal to the absolute path of the log file.
    • For example, on Windows:

      java.util.logging.FileHandler.pattern=C:/Program Files/Apache Software Foundation/Tomcat 5.5/logs/google-connectors.sharepoint%g.log

      Note that the forward slashes are the correct syntax.

    • For example, on Linux:

      java.util.logging.FileHandler.pattern = /root/Tomcat 5.5/logs/google-connectors.sharepoint%g.log

  12. On Windows, finish configuring logging with the following steps.
    1. Click Start > Programs > Apache Tomcat N > Configure Tomcat.
    2. On the Java tab, under Java Options, add the following:

      -Djava.util.logging.manager=java.util.logging.LogManager
      -Djava.util.logging.config.file=Catalina_home_path\webapps\connector-manager\WEB-INF\classes\logging.properties

    3. Click OK.
    4. Skip to Step 14.
  13. On Linux, finish configuring logging with the following steps.
    1. In a text editor, open the file $CATALINA_HOME/bin/Catalina.sh.
    2. Locate the section in which logging is set, which reads as follows:

      if [ -r "$CATALINA_HOME"/bin/tomcat-juli.jar ]; then

      JAVA_OPTS="$JAVA_OPTS "-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager" "-Djava.util.logging.config.file="$CATALINA_BASE/conf/logging.properties"

    3. Change the JAVA_OPTS value to the following:

      JAVA_OPTS="$JAVA_OPTS "-Djava.util.logging.manager=java.util.logging.LogManager" "-Djava.util.logging.config.file="$CATALINA_BASE/webapps/connector-manager/WEB-INF/classes/logging.properties"

  14. Restart the Tomcat server.
  15. To confirm whether the Tomcat server has restarted correctly and the connector is installed, navigate to the $CATALINA_HOME/webapps/connector-manager/WEB-INF/connectors directory, and verify that the $CATALINA_HOME/webapps/connector-manager/WEB-INF/connectors/sharepoint-connector directory exists.

Configuring the Connector and Crawl Patterns on the Admin Console

This section describes tasks you must perform on the Google Search Appliance Admin Console to configure the connector and the crawl patterns required by the connector.

Ensure that you complete all of the tasks described in the following sections:

Registering the Connector Manager

Use the instructions in Administering Connectors to register the newly-installed connector manager on the Admin Console.

Configuring the Crawl Patterns

The SharePoint connector uses a metadata and URL feed. After the connector traverses the SharePoint sites, the Google Search Appliance crawls and indexes the content on the SharePoint sites. See How SharePoint Sites are Indexed for a complete description of the process. You must therefore enter URL patters in the Follow and Crawl only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns fields on the Crawl and Index page on the Admin Console.

The SharePoint Site Alias Host feature affects the patterns that you enter. The Google Search Appliance uses the alias for crawling and indexing, so you must enter patterns for the alias host and alias port in the fields on the Crawl and Index page. If the alias port is 80, you must make a double entry in the Follow and Crawl only URLs with the Following Patterns field. For example, if the host is external.acme.com, you would enter both of the following URLs:

http://external.acme.com/
http://external.acme.com:80/

To configure the crawl:

  1. Navigate to Crawl and Index > Crawl URLs.
  2. Enter URL patterns in the Follow and Crawl only URLs with the Following Patterns field.
  3. Enter URL patterns in the Do Not Crawl URLs with the Following Patterns field.
  4. Examine the file types listed in the Do Not Crawl URLs with the Following Patterns field and comment out any image or graphic types that you want indexed.

    For example, you might comment out .jpg$ and .gif$.

Providing User Credentials for the Crawler

The Google Search Appliance must have credentials for crawling the SharePoint content. The patterns you enter here must match what you entered on the connector configuration page. Enter all of the pattern URLs, using regular expression patterns that cover broad sets of hosts on your network. These patterns will probably be the same patterns you defined in the Follow and Crawl only URLs with the Following Patterns field on the Crawl and Index page. Do not enter only the top-level Crawl URL.

To provide user credentials for the crawler:

  1. On the Admin Console, navigate to the Crawl and Index > Crawler Access page.
  2. Type the SharePoint pattern URLs in the For URLs Matching Pattern field.
  3. Type in the User Name and Password required for accessing the URLs.
  4. Click Save Crawler Access Configuration.

Configuring a Connector Instance

You can define a connector instance for each SharePoint subsite, or the main top-level site. A Microsoft SharePoint connector instance traverses the site specified in its SharePoint URL, including any subsites that are located under that site.

To configure an instance of a Microsoft SharePoint connector:

  1. Open the Admin Console.
  2. Click Crawl and index > Feeds.
  3. In the List of Trusted IP Addresses section, select Trust feeds from all IP addresses or Only trust feeds from these IP addresses.
  4. If you selected Only trust feeds from these IP addresses in step 3, type in the trusted IP addresses.
  5. Click Save Settings.
  6. Click Connector Administration > Connectors.
  7. Select the appropriate Connector Manager from the list.
  8. Click Add New Connector to create a new SharePoint Connector instance.
  9. Specify Connector Name and click Get Configuration Form.

The following table describes the fields that you must complete to configure a Microsoft SharePoint connector:

Name Description Values and Usage
SharePoint Version The SharePoint server type that you want the connector to traverse SharePoint 2007: Microsoft SharePoint Server 2007, applying to MOSS 2007 or WSS 3.0

SharePoint 2003: Microsoft SharePoint Server 2003, applying to SPS 2003 or WSS 2.0
Crawl URL The URL for the SharePoint site that you want to traverse. The Google Search Appliance traverses this site and any subsites found under it. This is the Crawl URL you designate on the Connector Configuration page, not the Google Search Appliance Crawl Configuration page.

The URL must contain a fully qualified domain name. The following URLs are acceptable:

  • The root URL of the site, for example, http://www.abc.com.
  • Top-level of site, for example, http://www.abc.com/sites/whatever.
  • URLs starting with https, for example, https://www.abc.com/sites/secret.

We recommend that you do not have two connector instances accessing the same SharePoint Crawl URL.
SharePoint Site Alias Host Name A fully-qualified host name that will be used to replace the host name in the Crawl URL See How Host Aliases are Supported.
SharePoint Site Alias Port Number The port number used with the Alias Host Name. Optional field. If the default http port is used, enter 80 in this field.
Windows Domain A valid domain name. The Windows domain where the user is authenticated. If you are using a local (machine) user, provide the machine name or IP address.
Username and password A valid username and password on the SharePoint Server's domain. The user must have Site Collection Administrator privileges in SharePoint and must be a member of the Windows local administrator group on the SharePoint host. For more information on user permissions, see How Security is Supported.
MySite URL (MOSS 2007 only) URL for the SharePoint MySite that you want to traverse with this connector instance. The Google Enterprise Connector for Microsoft SharePoint uses the MySite base URL and the credentials you provide to determine the complete MySite URL, then crawls MySite and feeds metadata and URLs to the Google Search Appliance for indexing.

For example, if the MySite URL is: http://server.domain/personal/administrator/default.aspx, enter http://server.domain. This is an optional field. In SPS 2003, the personal site is deployed on the same port as the portal. Therefore, in connector version 1.1.0, the field is only available when SharePoint 2007 is selected in the SharePoint Version field.
Include URLs Matching the Following Patterns URL patterns that limit the sites that the connector traverses when it follows links and discovers SharePoint sites Enter regular expressions. Under version 1.1.0, each URL must be on a new line. Under version 1.0, the URLs must be separated by commas.

The patterns must include the Crawl URL and must include the MySite URL if you specified MySite. The connector uses these patterns as boundaries when it discovers and traverses SharePoint sites. The connector might discover other sites linked to the SharePoint site defined with the Crawl URL, so the URL patterns you enter here must broad enough to include those other sites.

Although these URL patterns can be regular expressions, the format is slightly different from the regular expression patterns used throughout the Google Search Appliance Admin Console. The regular expression patterns elsewhere, such as on the Crawl and Index page are Perl-based, while the patterns on the SharePoint connector page are Java-based. For a complete reference, please see http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

Do Not Include URLs Matching the Following Patterns URL patterns that exclude particular parts of SharePoint sites that the connector discovers when it follows links during traversal Optional field. If used, enter regular expressions. Under version 1.1.0, each URL must be on a new line. Under version 1.0, the URLs must be separated by commas.

The connector uses these patterns to exclude particular sections of the SharePoint sites that are discovered during traversal. See How SharePoint Sites are Indexed for information on the complete process. Because the SharePoint connector relies on a metadata and URL feed, the Google Search Appliance crawls and indexes SharePoint sites after the URLs are retrieved during traversal.

Although these URL patterns can be regular expressions, the format is slightly different from the regular expression patterns used throughout the Google Search Appliance Admin Console. The regular expression patterns elsewhere, such as on the Crawl and Index page are Perl-based, while the patterns on the SharePoint connector page are Java-based. For a complete reference, please see http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html.

Scheduling the Connector

To ensure that the content is traversed on the schedule you require, complete the connector schedule page.

Note that a connector scheduled to run from 12 a.m. to 12 a.m. always runs. Any other schedule with the same beginning and ending time never runs, either for a connector or for the Google Search Appliance's standard crawl function.

Restarting the Connector

After you complete configuring the connector on the Admin Console, restart the connector.

Verifying that the Connector is Working

After you restart the connector, verify on the Admin Console that the Google Search Appliance is receiving feeds and verify on the Crawl Diagnostics page that there are indexed URLs.

Forcing a Recrawl of SharePoint Content

Under some circumstances, you might need to force a complete recrawl of content located in SharePoint. For example, you might see feeds fail because of a configuration issue. Use these instructions to force a full recrawl.

To force a recrawl of SharePoint content:

  1. On the connector host, navigate to the location of the connector state file.
    • On Windows, this is C:\Program Files\GoogleConnectors\SharePoint1\Tomcat\webapps\connector-manager\WEB-INF\connectors\sharepoint-connector\Sharepoint Connector Instance Name\.
    • On Linux, this is \Tomcat\webapps\connector-manager\WEB-INF\connectors\sharepoint-connector\Sharepoint Connector Instance Name\.
  2. Delete the file Sharepoint_state.xml file.
  3. Restart the SharePoint connector.

    The connector traverses the content again and generates new feeds.

Troubleshooting

This section provides information on troubleshooting the Google Enterprise Connector for Microsoft SharePoint Server 2007 or 2003. This section includes the following topics:

Logging

Logging is a useful technique for recording information about how your installation is operating. You can use the information logged for troubleshooting the operations of the connector, the Google Search Appliance, and Microsoft SharePoint Server.

The connector manager and connectors use the java.util.logging package for logging. The installer installs a logging mechanism for the connector and starts the logging process automatically. The default logging configuration is defined in the logging.properties file.

To customize the configuration, navigate to
connectors_root_dir/connector_name/Tomcat/webapps/connector-manager/WEB-INF/classes and edit the logging.properties file there.

The following line in the file sets the default logging level for the SharePoint connector:

com.google.enterprise.connector.sharepoint.level = INFO

This property is in the Global section of the logging.properties file. The logging level of INFO applies to all handlers, for example, to the File and Console handlers, unless you specify a higher level of logging for a particular handler. The value of the property also sets the logging level for all classes inside the package com.google.enterprise.connector.sharepoint.

The default logging level for most packages and output destinations (handlers) is INFO. To enable debugging at a finer level of granularity, you can change the package-specific settings to ALL or FINER. For example, you might change the logging level as follows:

com.google.enterprise.connector.sharepoint.level = ALL

The possible values of the level property are OFF, SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST, and ALL.

Particular handler settings work together with package-level settings. If you change the logging level for a package, you might need to change the logging levels at the handler level. The handler logging level must be set to at least the output level of the package logging level.

For example, if you set the logging level of com.google.enterprise.connector.sharepoint.level to ALL and the FileHandler level is set to INFO, logging to the FileHandler fails because the package logging level is higher than the handler logging level. In that situation, change the FileHandler logging level to ALL:

java.util.logging.FileHandler.level = ALL

The output from the ConsoleHandler appears in the connectors_root_dir/connector_name/Tomcat/logs directory. On Windows, the output appears in the stdout_date.log file, and on Unix the output appears in the catalina.out file.

The output from the FileHandler appears in the connectors_root_dir/connector_name/Tomcat/logs directory. The output appears in the google-connectors.connector_typesequence.log file, where sequence is a series of numbers starting with 0 and incremented by 1 on each occurrence (0, 1, 2, 3...n).

To log all http communications between the connector and the SharePoint server, use the httpclient.wire log. Set this log in the logging.properties file only to debug problems, because a very large amount of data is logged, some of it in binary format.

The default level is set at SEVERE:

httpclient.wire.level = SEVERE

Change the level to ALL:

httpclient.wire.level = ALL

After editing the logging.properties file, restart Tomcat.

Error Messages

This section describes some commonly encountered error messages and their likely solutions.

Crawl URL does not match 'Include URL' patterns or matches 'Do Not Include URL' patterns.

You see this message when a user-provided Crawl URL does not match patterns specified under "Include URLs Matching the Following Patterns" or matches patterns specified under "Do Not Include URLs Matching the Following Patterns".  The administrator should provide non-conflicting patterns for Include URLs Matching the Following Patterns and Do Not Include URLs Matching the Following Patterns.

Required field not specified.

Fields marked with an asterisk (*) on the Configuring Connector Instances form are required. You must provide appropriate values for these fields.

The Crawl URL must contain a fully qualified domain name. Please check the Crawl URL value.

You must provide the appropriate SharePoint Site URL with a fully qualified domain name for SharePoint Site URL field on Configuring Connector Instances.

Cannot connect to the given SharePoint Site URL with the supplied Domain/Username/Password. Please re-enter.

You must provide appropriate values for all the mandatory fields on configuring connector instances.

Note : All other error messages are available in the Tomcat log file.

Crawl Diagnostics Error Message

If there is no robots.txt file or if the robots.txt file is not correctly defined in SharePoint, you see an error message:

Retrying URL: Host unreachable while trying to fetch robots.txt.

To correct the error:

  1. Check whether the robots.txt file exists in the SharePoint root directory.
  2. If there is no robots.txt file there, create one.
  3. Ensure that the robots.txt file is correctly excluded from SharePoint's managed path.

    For instructions, see Using a robots.txt File to Enable or Restrict Crawl.

  4. Ensure that the path to the robots.txt file is defined correctly on the on the Crawler Access page on the Admin Console.

ProcessNode Error

You might see the following error message on the Crawl Diagnostics page in the Admin Console, where URL is the URL to a graphic file:

ProcessNode: Not match URL patterns, skipping record with URL: URL

Ensure that you have modified the crawl patterns to include graphic formats. For information on including graphic formats, see Configuring the Crawl Patterns.

State File Not Created or javax.xml.transform.TransformerFactoryConfigurationError Error in Log

You might see the following error in the stderr_date log:

javax.xml.transform.TransformerFactoryConfigurationError: Provider org.apache.xalan.processor.TransformerFactoryImpl not found

The error means that the connector manager is running with JDK 1.5. The SharePoint connector requires JDK 1.4.2. Under JDK 1.5, the SharePoint connector appears to function, but some functionality fails silently. Another sympton of using the wrong JDK is that the state file is not created in tomcat\webapps\connector-manager\web-inf\connectors\connector_name\connector_name\.

To correct the error:

  1. Delete the connector and connector manager on the Google Search Appliance Admin Console.
  2. Uninstall the connector and connector manager on the connector manager host.
  3. Install JDK 1.4.2 on the connector manager host.
  4. Use the installation instructions in this document to recreate the connector manager and connector.

Related Documentation

Back to top