Google Search Appliance software version 5.0
Connector software versions 1.0 and 1.1.0
Posted October 2007
Revised December 2007: Added support for Microsoft SharePoint Portal Server
2003 (SPS 2003) and Microsoft
Windows SharePoint Services 2.0 (WSS 2.0), added logging information, expanded
connector configuration section.
Revised January 2008: Updates and corrections
Revised March 2008: Updates to manual installation section; correction to
Indexing SharePoint Sites That Have Many Subsites;
corrections to the section on upgrading
Revised June 2008: Clarified user privileges for user running the installer
This document is for Google Search Appliance administrators who want to set up and manage the Google Enterprise Connector for Microsoft SharePoint. Read this document and the following related documents:
This document gives an overview of how connectors work and describes the configuration steps common to all the connectors.
These pages describe connector independent configuration parameters available in the Connector Administration pages of the admin console.
This document is for SharePoint Server administrators and administrators who install and configure the Google Search Appliance. If you are working with the Google Enterprise Connector for Microsoft SharePoint and you are not familiar with SharePoint, work closely with a SharePoint system administrator to determine the correct values for installing and configuring the connector.
The Google Enterprise Connector for Microsoft SharePoint
enables the Google Search Appliance to
traverse documents and attachments on SharePoint sites. Instances of the connector
fetch metadata and URLs for
SharePoint documents and attachments using SharePoint Web services and direct
the data to the Google Search Appliance as a metadata and URL feed.
This connector is supported on the following SharePoint versions:
The Google Enterprise Connector for Microsoft SharePoint Server 2003 and Microsoft SharePoint Server 2007 is supported on the following operating systems:
The SharePoint connector is based on metadata and URL feeds. The connector sends URLs and related metadata to the Google Search Appliance through the connector manager. Those URLs are then crawled and indexed by the Google Search Appliance.
When you configure a SharePoint connector, you designate a particular SharePoint site or subsite as a crawl URL. The connector uses that site or subsite URL as a starting point, then traverses sites under the instance and sites whose links are discovered in the SharePoint content. The links can be to sites on hosts other than the host on which the initial SharePoint instance is located.
The connector identifies sites on other hosts as SharePoint sites by calling the appropriate web services. When you configure the connector, you provide URL patterns that define locations the connector must traverse and locations the connector is prohibited from traversing. Use these patterns to include your company's domains and to exclude sites you do not control or do not want traversed.
Because of a known limitation in SharePoint, the connector cannot traverse SharePoint applications (root sites) that have more than one thousand (1,000) subsites. If you need to traverse a SharePoint application that has more than one thousand subsites, you must use version 1.1.2 of the SharePoint connector.
The Microsoft SharePoint connector can traverse different types of content, depending on the SharePoint version.
Regardless of which SharePoint version you use, the Google Search appliances excludes images and graphics from indexing by default. When the Google Search Appliance tries to index SharePoint Picture Libraries, you see error messages on the Crawl and Index > Feeds page on the Admin Console. See Configuring the Crawl Patterns for instructions for including image and graphics files in the index.
Under Microsoft SharePoint Portal Server 2003 and Microsoft Windows SharePoint Services 2.0, the connector can traverse the following types of content:
Under Microsoft Office SharePoint Server 2007 and Microsoft Windows SharePoint Services 3.0, the connector can traverse the following types of content:
The two sections that follow describe how security is supported during the serve process and during the traversal and indexing processes.
At serve time, the Google Search Appliance supports document-level authorization of each search user. Content in a SharePoint repository can be served as secure or public content.
The value of the Make Public check box on the Admin Console determines whether content is secure or public.
You can provide single-sign on capabilities (SSO) using the Google SAML Bridge for Windows. For more information, see Enabling Windows Integrated Authentication.
The Microsoft SharePoint connector and the Google Search Appliance require user credentials for traversal and indexing. Google recommends that you use a single user account for both. Any account you use for traversal and indexing must have Site Collection Administrator privileges in SharePoint and must be a member of the Windows local administrator group on the SharePoint host.
You provide the credentials on the connector configuration page.
You must provide a domain name or host name in order to enable Windows NT LAN Manager (NTLM) or HTTP Basic authentication.
The SharePoint Site Alias Host name is similar to the Alternative Access Mapping in SharePoint. Both features allow multiple entry points to a particular web application, for example, a SharePoint instance used internally by one group of users and externally by partners and other trusted individuals. The entry points for the internal and external users are different URLs. In such a case, the connector uses the internal URL to traverse the SharePoint content, but the Google Search Appliance uses the external URL to crawl and serve the content. The host alias is defined for the appliance, for the crawl and serve processes, and the Crawl URL is defined for the connector to traverse.
This section contains instructions for configuring SharePoint for the connector.
The Google Search Appliance cannot crawl the content unless a robots.txt file is present in the SharePoint site's root directory. Ensure that you create a robots.txt file and ensure that the file is public.
If you are preventing the Google Search Appliance from crawling particular content, put the correct URLs in the robots.txt file.
If you are allowing the Google Search Appliance to crawl all content, ensure that the robots.txt file includes the following syntax:
User-agent: *
Disallow:
When you create a robots.txt file, you must define a managed path for the file in SharePoint. The managed path is configured differently, depending on which SharePoint version you are using.
To define a managed path under SharePoint Portal Server 2003 and Windows SharePoint Services 2.0:
/robots.txt
A browser window appears and shows a 404 error. This is the correct behavior.
The robots.txt file is added to the excluded paths.
/robots.txt
The contents of the robots.txt file are now displayed.
Retrying URL: Host unreachable
while trying to fetch the robots.txt file, open the Google Search
Appliance Admin Console and refresh the crawl from the Crawl and
Index > Freshness Tuning page.
After some time passes, you see the SharePoint site content crawled. Note that the Freshness Tuning page is available only if the Google Search Appliance is in continuous crawl mode.
To define a managed path under SharePoint Server 2007 and Windows SharePoint Services 3.0:
/robots.txt
Retrying URL: Host unreachable
while trying to fetch the robots.txt file, open the Google Search
Appliance Admin Console and refresh the crawl from the Crawl and
Index > Freshness Tuning page.
After some time passes, you see the SharePoint site content crawled. Note that the Freshness Tuning page is available only if the Google Search Appliance is in continuous crawl mode.
This section applies only to Microsoft Office SharePoint Server 2007 and Microsoft Windows SharePoint Services 3.0.
The Google Search Appliance can crawl content only if URLs contain fully qualified host names. By default, Microsoft SharePoint Server 2007 uses short names for user access to SharePoint sites. If your SharePoint sites are configured with short names, URLs are sent to the Google Search Appliance with short names. The Google Search Appliance cannot process these short URLs. This section tells you how to configure MOSS to use fully-qualified host names.
To configure SharePoint sites to use fully-qualified domain names:
The Alternate Access
Mappings dialog box
displays several
internal URLs for the SharePoint site and the admin site. The default
settings are short URLs. If you type
a fully qualified host name in the browser bar, you are
redirected to a short name. For example, if you type http://moss_host1.yourdomain.com/,
you are redirected to http://moss_host1/Default.aspx.
For example, change http://moss_host1/ to http://moss_host1.yourdomain.com/.
This section describes the installation prerequisites and installation process for the Google Enterprise Connector for Microsoft SharePoint Server 2007 and Microsoft SharePoint Server 2003.
You can install the connector using an installer that installs Apache Tomcat, a connector manager, and the connector or you can install the three software components manually. Google recommends that you use the installer unless you are building the connector manager or connector from the source code or you are installing a patch release that is not packaged with an installer. Ensure that you complete the prerequisites whether you use the installer or install the connector manually.
If you are running version 1.0 of the connector, you cannot upgrade directly to version 1.1.0 using the installer. Instead, use the instructions in Administering Connectors to uninstall the existing connector and install the new connector.
Before installing the connector using the installer, ensure that Java Development Kit (JDK) 1.4.2 is installed on the host where you are installing the connector.
The instructions that follow are in two parts. In the first part, you download and uncompress the installer package. In the second, you install the software on the connector host.
The user running the installer must have the following user privileges on the connector host:
To download and uncompress the installation package:
./GCI.bin LAX_VM/java_location_to_java
for example, ./GCI.bin LAX_VM /usr/java/j2sdk1.4.2_15/bin/java
-i
console argument appended.To install the connector and its supporting software:
The Licence Agreement panel appears.
The checkbox Start SharePoint Connector Service after installation determines whether the connector service start automatically on completion of the installation or must be started manually. It is checked by default.
The connector requires JDK 1.4.2.
An informational panel indicates that the connector installation is in progress. When the installation process is finished, a panel indicates that installation is complete.
Apache Tomcat starts and deploys the connector manager and connector.
You can choose the commands to stop the service or the console on the same menu.
./Start_SharePoint_Connector_Console
./Stop_SharePoint_Connector_Console
You need to install the connector manually only if you have built and installed a customized connector manager or a customized version of the connector or if you are installing a patch release that is not packaged with an installer. Otherwise, Google strongly recommends that you use the installer.
Before installing the connector, ensure that the following tasks have been performed:
$CATALINA_HOME. Follow
the installation instructions provided by Apache. To install the connector manually on Apache Tomcat, follow the instructions given below:
connector-sharepoint.jar file from the root directory
to the $CATALINA_HOME/webapps/connector-manager/WEB-INF/lib
directory. java.util.logging.FileHandler.pattern=C:/Program Files/Apache
Software Foundation/Tomcat 5.5/logs/google-connectors.sharepoint%g.log
Note that the forward slashes are the correct syntax.
java.util.logging.FileHandler.pattern = /root/Tomcat 5.5/logs/google-connectors.sharepoint%g.log
-Djava.util.logging.manager=java.util.logging.LogManager
-Djava.util.logging.config.file=Catalina_home_path\webapps\connector-manager\WEB-INF\classes\logging.properties
if [ -r "$CATALINA_HOME"/bin/tomcat-juli.jar ]; then
JAVA_OPTS="$JAVA_OPTS "-Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager" "-Djava.util.logging.config.file="$CATALINA_BASE/conf/logging.properties"
JAVA_OPTS="$JAVA_OPTS "-Djava.util.logging.manager=java.util.logging.LogManager" "-Djava.util.logging.config.file="$CATALINA_BASE/webapps/connector-manager/WEB-INF/classes/logging.properties"
This section describes tasks you must perform on the Google Search Appliance Admin Console to configure the connector and the crawl patterns required by the connector.
Ensure that you complete all of the tasks described in the following sections:
Use the instructions in Administering Connectors to register the newly-installed connector manager on the Admin Console.
The SharePoint connector uses a metadata and URL feed. After the connector traverses the SharePoint sites, the Google Search Appliance crawls and indexes the content on the SharePoint sites. See How SharePoint Sites are Indexed for a complete description of the process. You must therefore enter URL patters in the Follow and Crawl only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns fields on the Crawl and Index page on the Admin Console.
The SharePoint Site Alias Host feature affects the patterns that you enter. The Google Search Appliance uses the alias for crawling and indexing, so you must enter patterns for the alias host and alias port in the fields on the Crawl and Index page. If the alias port is 80, you must make a double entry in the Follow and Crawl only URLs with the Following Patterns field. For example, if the host is external.acme.com, you would enter both of the following URLs:
http://external.acme.com/
http://external.acme.com:80/
To configure the crawl:
For example, you might comment out .jpg$ and .gif$.
The Google Search Appliance must have credentials for crawling the SharePoint content. The patterns you enter here must match what you entered on the connector configuration page. Enter all of the pattern URLs, using regular expression patterns that cover broad sets of hosts on your network. These patterns will probably be the same patterns you defined in the Follow and Crawl only URLs with the Following Patterns field on the Crawl and Index page. Do not enter only the top-level Crawl URL.
To provide user credentials for the crawler:
You can define a connector instance for each SharePoint subsite, or the main top-level site. A Microsoft SharePoint connector instance traverses the site specified in its SharePoint URL, including any subsites that are located under that site.
To configure an instance of a Microsoft SharePoint connector:
The following table describes the fields that you must complete to configure a Microsoft SharePoint connector:
| Name | Description | Values and Usage |
|---|---|---|
| SharePoint Version | The SharePoint server type that you want the connector to traverse | SharePoint 2007: Microsoft SharePoint Server 2007, applying to MOSS 2007
or WSS 3.0 SharePoint 2003: Microsoft SharePoint Server 2003, applying to SPS 2003 or WSS 2.0 |
| Crawl URL | The URL for the SharePoint site that you want to traverse. | The Google Search Appliance traverses this site and any subsites found under
it.
This is the Crawl URL you designate on the Connector Configuration page, not
the Google Search Appliance Crawl Configuration page.
The URL must contain a fully qualified domain name. The following URLs are acceptable:
We recommend that you do not have two connector instances accessing the same SharePoint Crawl URL. |
| SharePoint Site Alias Host Name | A fully-qualified host name that will be used to replace the host name in the Crawl URL | See How Host Aliases are Supported. |
| SharePoint Site Alias Port Number | The port number used with the Alias Host Name. | Optional field. If the default http port is used, enter 80 in this field. |
| Windows Domain | A valid domain name. | The Windows domain where the user is authenticated. If you are using a local (machine) user, provide the machine name or IP address. |
| Username and password | A valid username and password on the SharePoint Server's domain. | The user must have Site Collection Administrator privileges in SharePoint and must be a member of the Windows local administrator group on the SharePoint host. For more information on user permissions, see How Security is Supported. |
| MySite URL (MOSS 2007 only) | URL for the SharePoint MySite that you want to traverse with this connector instance. | The Google Enterprise Connector for Microsoft SharePoint uses the MySite
base URL and the credentials you provide to determine the complete MySite
URL, then crawls MySite and feeds metadata and URLs to the Google Search
Appliance for indexing. For example, if the MySite URL is: http://server.domain/personal/administrator/default.aspx, enter http://server.domain. This is an optional field. In SPS 2003, the personal site is deployed on the same port as the portal. Therefore, in connector version 1.1.0, the field is only available when SharePoint 2007 is selected in the SharePoint Version field. |
| Include URLs Matching the Following Patterns | URL patterns that limit the sites that the connector traverses when it follows links and discovers SharePoint sites | Enter regular expressions. Under version
1.1.0, each URL must be on a new line. Under version 1.0, the URLs must
be separated by commas.
The patterns must include the Crawl URL and must include the MySite URL if you specified MySite. The connector uses these patterns as boundaries when it discovers and traverses SharePoint sites. The connector might discover other sites linked to the SharePoint site defined with the Crawl URL, so the URL patterns you enter here must broad enough to include those other sites. Although these URL patterns can be regular expressions, the format is slightly different from the regular expression patterns used throughout the Google Search Appliance Admin Console. The regular expression patterns elsewhere, such as on the Crawl and Index page are Perl-based, while the patterns on the SharePoint connector page are Java-based. For a complete reference, please see http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html. |
| Do Not Include URLs Matching the Following Patterns | URL patterns that exclude particular parts of SharePoint sites that the connector discovers when it follows links during traversal | Optional field. If used, enter regular expressions. Under version
1.1.0, each URL must be on a new line. Under version 1.0, the URLs must be
separated by commas.
The connector uses these patterns to exclude particular sections of the SharePoint sites that are discovered during traversal. See How SharePoint Sites are Indexed for information on the complete process. Because the SharePoint connector relies on a metadata and URL feed, the Google Search Appliance crawls and indexes SharePoint sites after the URLs are retrieved during traversal. Although these URL patterns can be regular expressions, the format is slightly different from the regular expression patterns used throughout the Google Search Appliance Admin Console. The regular expression patterns elsewhere, such as on the Crawl and Index page are Perl-based, while the patterns on the SharePoint connector page are Java-based. For a complete reference, please see http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html. |
To ensure that the content is traversed on the schedule you require, complete the connector schedule page.
Note that a connector scheduled to run from 12 a.m. to 12 a.m. always runs. Any other schedule with the same beginning and ending time never runs, either for a connector or for the Google Search Appliance's standard crawl function.
After you complete configuring the connector on the Admin Console, restart the connector.
./Stop_SharePoint_Connector_Console
./Start_SharePoint_Connector_Console
After you restart the connector, verify on the Admin Console that the Google Search Appliance is receiving feeds and verify on the Crawl Diagnostics page that there are indexed URLs.
Under some circumstances, you might need to force a complete recrawl of content located in SharePoint. For example, you might see feeds fail because of a configuration issue. Use these instructions to force a full recrawl.
To force a recrawl of SharePoint content:
The connector traverses the content again and generates new feeds.
This section provides information on troubleshooting the Google Enterprise Connector for Microsoft SharePoint Server 2007 or 2003. This section includes the following topics:
Logging is a useful technique for recording information about how your installation is operating. You can use the information logged for troubleshooting the operations of the connector, the Google Search Appliance, and Microsoft SharePoint Server.
The connector manager and connectors use the java.util.logging package for logging. The installer installs a logging mechanism for the connector and starts the logging process automatically. The default logging configuration is defined in the logging.properties file.
To customize the configuration, navigate to
connectors_root_dir/connector_name/Tomcat/webapps/connector-manager/WEB-INF/classes
and edit the logging.properties file there.
The following line in the file sets the default logging level for the SharePoint connector:
com.google.enterprise.connector.sharepoint.level = INFO
This property is in the Global section of the logging.properties file. The logging
level of INFO applies to all handlers, for example, to the File and Console handlers,
unless you specify a higher level of logging for a particular handler. The value
of the property also sets the logging level for all classes inside the package
com.google.enterprise.connector.sharepoint.
The default logging level for most packages and output destinations (handlers)
is INFO. To enable debugging at a finer level of granularity, you
can change the package-specific settings to ALL or FINER. For example, you might
change the logging level as follows:
com.google.enterprise.connector.sharepoint.level = ALL
The possible values of the level property are OFF, SEVERE, WARNING, INFO, CONFIG, FINE, FINER, FINEST,
and ALL.
Particular handler settings work together with package-level settings. If you change the logging level for a package, you might need to change the logging levels at the handler level. The handler logging level must be set to at least the output level of the package logging level.
For example, if you set the logging level of com.google.enterprise.connector.sharepoint.level
to ALL and the FileHandler level
is set to INFO, logging to the FileHandler fails because the package logging
level is higher than the handler logging level. In that situation, change the
FileHandler logging level to ALL:
java.util.logging.FileHandler.level = ALL
The output from the ConsoleHandler appears in the connectors_root_dir/connector_name/Tomcat/logs
directory. On Windows, the output appears in the stdout_date.log file,
and on Unix the output appears in the catalina.out file.
The output from the FileHandler appears in the connectors_root_dir/connector_name/Tomcat/logs
directory. The output appears in the google-connectors.connector_typesequence.log
file, where sequence is a series of numbers starting with 0 and incremented
by 1 on each occurrence (0, 1, 2, 3...n).
To log all http communications between the connector and the SharePoint server, use the httpclient.wire log. Set this log in the logging.properties file only to debug problems, because a very large amount of data is logged, some of it in binary format.
The default level is set at SEVERE:
httpclient.wire.level = SEVERE
Change the level to ALL:
httpclient.wire.level = ALL
After editing the logging.properties file, restart Tomcat.
This section describes some commonly encountered error messages and their likely solutions.
You see this message when a user-provided Crawl URL does not match patterns specified under "Include URLs Matching the Following Patterns" or matches patterns specified under "Do Not Include URLs Matching the Following Patterns". The administrator should provide non-conflicting patterns for Include URLs Matching the Following Patterns and Do Not Include URLs Matching the Following Patterns.
Fields marked with an asterisk (*) on the Configuring Connector Instances form are required. You must provide appropriate values for these fields.
You must provide the appropriate SharePoint Site URL with a fully qualified domain name for SharePoint Site URL field on Configuring Connector Instances.
You must provide appropriate values for all the mandatory fields on configuring connector instances.
Note : All other error messages are available in the Tomcat log file.
If there is no robots.txt file or if the robots.txt file is not correctly defined in SharePoint, you see an error message:
Retrying URL: Host unreachable while trying to fetch robots.txt.
To correct the error:
For instructions, see Using a robots.txt File to Enable or Restrict Crawl.
You might see the following error message on the Crawl Diagnostics page in the Admin Console, where URL is the URL to a graphic file:
ProcessNode: Not match URL patterns, skipping record with URL: URL
Ensure that you have modified the crawl patterns to include graphic formats. For information on including graphic formats, see Configuring the Crawl Patterns.
You might see the following error in the stderr_date log:
javax.xml.transform.TransformerFactoryConfigurationError: Provider org.apache.xalan.processor.TransformerFactoryImpl
not found
The error means that the connector manager is running with JDK 1.5. The SharePoint connector requires JDK 1.4.2. Under JDK 1.5, the SharePoint connector appears to function, but some functionality fails silently. Another sympton of using the wrong JDK is that the state file is not created in tomcat\webapps\connector-manager\web-inf\connectors\connector_name\connector_name\.
To correct the error: