Feeds Protocol Developer's Guide

Software Version 4.6.
Posted 2006

This document is for developers who use the Google Search Appliance Feeds Protocol to develop custom feed clients that push content and metadata to the Google Search Appliance for processing, indexing, and serving as search results. Feeds are available in software versions 4.2 and later.

To push content to the Google Search Appliance, you require a feed and a feed client:

This document explains how feeds work and shows you how to write a basic feed client.

Contents

  1. Overview
    1. Why Use Feeds
    2. Writing Your Own Feed Client
  2. Quickstart
  3. Designing an XML Feed
    1. Choosing a Name for the Feed Data Source
    2. Choosing the Feed Type
    3. Defining the XML Record for a Document
    4. Grouping Records Together
    5. Providing Content in the Feed
    6. Adding Metadata Information to a Record
    7. Including Protected Documents in Search Results
    8. Feeding Content from a Database
    9. Saving your XML Feed
  4. Pushing a Feed to the Google Search Appliance
    1. Designing a Feed Client
    2. Using a Web Form Feed Client
    3. How a Feed Client Pushes a Feed
  5. Turning Feed Contents into Search Results
    1. URL Patterns
    2. Trusted IP Lists
    3. Adding Feed Content
    4. Removing Feed Content From the Index
    5. Time Required to Process a Feed
    6. License Limits and Page Rank
  6. Troubleshooting
    1. Error Messages on the Feeds Status Page
    2. Fed Documents Aren't Appearing in Search Results
    3. Fed Documents Aren't Updated or Removed as Specified in the Feed XML
    4. Document Status is Stuck "In Progress"
    5. Feed Client TCP Error
  7. Example Feeds
    1. Web Feed
    2. Web Feed with Metadata
    3. Full Content Feed
    4. Incremental Content Feed
  8. The Google Search Appliance Feed DTD

Overview

You can use feeds to push data into the index on the Google Search Appliance. There are two types of feeds:

Web feeds and content feeds also behave differently when deleting content. See Removing Feed Content from the Index for a description of how content is deleted from each type of feed.

To see an example of a feed, follow the steps in the Quickstart section of this document.

Why Use Feeds

You should design a feed to ensure that your Google Search Appliance crawls any documents that require special handling. Consider whether your site includes content that cannot be found through links on crawled web pages, or content that is most useful when it is crawled at a specific time.

Examples of documents that are best pushed using feeds include:

Writing Your Own Feed Client

You push the XML to the Google Search Appliance using a feed client. You can use one of the feed clients described in this document or write your own. To write your own feed client, you should be familiar with these technologies:

Quickstart

Here are steps for pushing a content feed to the search appliance.

  1. Download sample_feed.xml to your local computer. This is a content feed for a document titled "Fed Document".
  2. In the Admin Console, go to Crawl and Index > Crawl URLs and add this pattern to "Follow and Crawl Only URLs with the Following Patterns":
    http://www.localhost.test.com
    This is the URL for the document defined in sample_feed.xml.
  3. Download pushfeed_client.py to your local computer. This is a feed client script. You must install Python 2.2 or later to run this script.
  4. Configure the Google Search Appliance to accept feeds from your computer. In the Admin Console, go to Google Search Appliance > Crawl and Index > Feeds, and scroll down to List of Trusted IP Addresses. Verify that the IP address of your local computer is trusted.
  5. Run the feed client script with the following arguments (you must change "APPLIANCE-HOSTNAME" to the hostname or IP address of your Google Search Appliance):
    % pushfeed_client.py --datasource="sample" --feedtype="full"
      --url="http://<APPLIANCE-HOSTNAME>:19900/xmlfeed" --xmlfilename="sample_feed.xml"
  6. In the Admin Console, go to Crawl and Index > Feeds. A data source named "sample" should appear within 5 minutes.
  7. The URL http://www.localhost.test.com/ should appear under Crawl Diagnostics within about 15 minutes.
  8. Enter the following as your search query to see the URL in the results:
    info:http://www.localhost.test.com/
    If your system is not busy, the URL should appear in your search results within 30 minutes.

Designing an XML Feed

The feed is an XML file that contains the URLs. It may also contain their contents, metadata, and additional information such as the last-modified date. The XML must conform to the schema defined by gsafeed.dtd. This file is also available on your appliance at http://<APPLIANCE-HOSTNAME>:7800/gsafeed.dtd. Although the DTD defines elements for the data source name and the feed type, these elements are populated when you push the feed to the Google Search Appliance and any datasource or feedtype values that you specify within the XML document are ignored.

The maximum size for a feed XML file is 1 GB. Files larger than 1 GB cannot be pushed to the Google Search Appliance.

Choosing a Name for the Feed Data Source

When you push a feed to the Google Search Appliance, the system associates the fed URLs with a data source name, specified by the datasource element in the feed DTD.

To view all of the feeds for your Google Search Appliance, log into the Admin Console and choose Crawl and Index > Feeds. The list shows the date of the most recent push for each data source name, along with whether the feed was successful and how many documents were pushed.

Note: Although you can specify the feed type and data source in the XML file, the values specified in the XML file are currently unused. Instead, the Google Search Appliance uses the data source and feed type that are specified during the feed upload step. However, we recommend that you include the data source name and feed type in the XML file for compatibility with future versions.

Choosing the Feed Type

The feed type determines how the Google Search Appliance handles URLs when a new content feed is pushed with an existing data source name.

Content feeds can be full or incremental; a web feed is always incremental. To support feeds that provide only URLs and metadata, in versions 4.6.2.S.12 and later you can also set the feed type to metadata-and-url. This is a special feed type that is treated as a web feed.

Full Feeds and Incremental Feeds

Incremental feeds generally require fewer system resources than full feeds. A large feed can often be crawled more efficiently if it is divided into smaller incremental feeds.

Here is an example that illustrates the effect of a full feed:

Note: It can take up to 6 hours for documents that are not part of the latest full feed to be removed from the index.

Here is an example that mixes full and incremental feeds:

Defining the XML Record for a Document

You include documents in your feed by defining them inside a record element. All records must specify a URL which is used as the unique identifier for the document. If the original document doesn't have a URL, but has some other unique identifier, you must map the document to a unique URL in order to identify it in the feed.

Each record element can specify following attributes:

Grouping Records Together

The optional group element allows you to apply an action to many records at once. For example, this:

<group action="delete">
	<record url="http://www.corp.enterprise.com/hello01"/>
	<record url="http://www.corp.enterprise.com/hello02"/>
	<record url="http://www.corp.enterprise.com/hello03"/>
  </group>

Is equivalent to this:

	<record url="http://www.corp.enterprise.com/hello01" action="delete"/>
	<record url="http://www.corp.enterprise.com/hello02" action="delete"/>
	<record url="http://www.corp.enterprise.com/hello03" action="delete"/>

However, if you define any actions for records as a group, the record's definition always overrides the group's definition. For example:

<group action="delete">
	<record url="http://www.corp.enterprise.com/hello01"/>
	<record url="http://www.corp.enterprise.com/hello02" action="add"/>
	<record url="http://www.corp.enterprise.com/hello03"/>  
</group>

In this example, hello01 and hello03 would be deleted, and hello02 would be updated.

Providing Content in the Feed

You add document content by placing it inside the record definition for your content feed. For example, using text content:

<record url="..." mimetype="text/plain">
	<content>Hello world. Here is some page content. </content>
</record>

You can also define content as HTML:

<record url="..." mimetype="text/html">
	<content><![CDATA[<html> <title>hello world</title> 
<body> <p> Here is some page content. </p>
</body> </html>]]></content>
</record>

To include non-text documents such as .pdf or .doc files, you must encode the content using base64 encoding. Using base64 encoding ensures that the feed can be parsed as valid XML.

Here is a record definition that includes base64 encoded content:

<record url="..." mimetype="text/plain">
	<content encoding="base64binary">Zm9vIGJhcgo</content>
</record>

Because base64 encoding increases the document size by one third, it is often more efficient to include non-text documents as URLs in a web feed. Only contents that are embedded in the XML feed must be encoded; this restriction does not apply to contents that are crawled.

Adding Metadata Information to a Record

Metadata can be included in record definitions for web, metadata-and-url, and content feeds, as described in the following table:

  data source name feed type push behavior allows metadata? allows content?
web feed web incremental incremental no no
any metadata-and-url incremental yes no
content feed any incremental incremental yes yes
any full full yes yes

If the metadata is part of a feed, it must have the following format:

<record url="..." ...>
  <metadata>
    <meta name="..." content="..." />
    <meta name="..." content="..." />
  </metadata>
...
</record>

See the External Metadata Indexing Guide for more information about indexing external metadata and examples of metadata feeds.

Including Protected Documents in Search Results

Feeds can push protected contents to the Google Search Appliance. If your feed contains URLs that are protected by NTLM, Basic Authentication, or Single Sign-on, the URL record in the feed must specify the correct type of authentication. You must also configure settings in the Admin Console to allow the Google Search Appliance to crawl the secured pages.

The authmethod attribute for the record defines the type of authentication. By default, authmethod is set to "none". To enable secure search from a feed, set the authentication attribute for the record to ntlm, httpbasic, or httpsso. For example, to enable authentication for protected files on localhost.test.com via Single Sign-on, you would define the record as:

<record url="http://www.localhost.test.com/" authmethod="httpsso">

To grant the Google Search Appliance access to the protected pages in your feed, log into the Admin Console.

For URLs that are protected by NTLM and Basic Authentication, follow these steps:

  1. Open Crawl and Index > Crawler Access.
  2. Define a pattern that matches the protected URLs in the feed.
  3. Enter a username and password that will allow the crawler access to the protected contents. For contents on a Microsoft IIS server, you may also need to specify a domain.
  4. The Make Public check box controls whether the Google Search Appliance checks for valid authentication credentials before including protected contents in the search results. If you select the Make Public check box, the record is displayed in search results. Otherwise, the record is shown when the user has valid authentication credentials; users who do not have access to the protected content will not see it in their search results. By default, search results are protected.

For URLs that are protected by Single Sign-on, follow these steps:

  1. Open Crawl and Index > Forms Authentication.
  2. Under Sample Forms Authentication protected URL, enter the URL of a page in the protected site that will redirect the user to a login form. The login form must not contain Javascript or frames. If you have more than one login page, create a Forms Authentication rule for each login.
  3. Under URL pattern for this rule, enter a pattern that matches the protected URLs in the feed.
  4. Click Create a New Forms Authentication Rule. In the browser page that opens, use the login form to enter a valid username and password. These credentials allow the crawler access to the protected contents. If the login information is accepted, you should see the protected page that you specified. If you can see the protected URL contents, click the Save Forms Authentication Rule and Close Window button. The Forms Authentication page now displays your rule.
  5. Make any changes to the rule. For example, the Make Public check box controls whether the Google Search Appliance checks for valid authentication credentials before including protected contents in the search results. If you select the Make Public check box, the record is displayed in search results. Otherwise, the record is shown when the user has valid authentication credentials; users who do not have access to the protected content will not see it in their search results. By default, search results are protected.
  6. When you have finished making changes to the rule, click the Save Forms Authentication Rule Configuration button.

This is one way of providing access to protected documents. For more information on authentication, please refer to the online help that is available in the Google Search Appliance's Admin Console, and the support document Crawling and Serving Secure Content.

Feeding Content from a Database

To push records from a database into the Google Search Appliance's index, you use a special content feed that is generated by the Google Search Appliance based on parameters that you set in the Admin Console. To set up a feed for database content, log into the Admin Console and choose Crawl and Index > Database. You can find more information on how to define a database-driven data source in the online help that is available in the Admin Console, and in the support document Database Crawling and Serving.

Saving your XML Feed

You should save a backup copy of your XML Feed in case you need to push it again. For example, if you perform a version update that requires you to rebuild the index, you must push all your feeds again to restore them to the Google Search Appliance. The Google Search Appliance does not archive copies of your feeds.

Pushing a Feed to the Google Search Appliance

This section describes how to design a feed client. If you don't want to design your own feed client script, you can use one of the following methods to push your feed:

The IP address of the computer that hosts the feed client must be in the List of Trusted IP Addresses. In the Admin Console, go to Google Search Appliance > Crawl and Index > Feeds, and scroll down to List of Trusted IP Addresses. Verify that the IP address for your feed client appears in this list.

Designing a Feed Client

You upload an XML feed using an HTTP POST to the feedergate server located on port 19900 of your Google Search Appliance. Feeds cannot be uploaded using HTTPS. The XML feed must be < 1 GB in size. If your feed is larger than 1 GB, consider breaking the feed into smaller feeds that can be pushed more efficiently.

The feedergate server requires three input parameters from the POST operation:

The URL that you should use is:

http://<APPLIANCE-HOSTNAME>:19900/xmlfeed

You should post the feed using enctype="multipart/form-data". Although the Google Search Appliance supports uploads using enctype="application/x-www-form-urlencoded", this encoding type is not recommended for large amounts of data.

Using a Web Form Feed Client

Here is an example of a simple HTML form for pushing a feed to the appliance. Because the web form requires user input, this method cannot be automated.

To adapt this form for your Google Search Appliance, replace APPLIANCE-HOSTNAME with the fully qualified domain name of your appliance.

<html>
  <head>
    <title>Simple form for pushing a feed</title>
  </head>
  <body>
   <h1>Simple form for pushing a feed</h1>
    <form enctype="multipart/form-data" method=POST
         action="http://<APPLIANCE-HOSTNAME>:19900/xmlfeed">
      <p>Name of datasource:
        <input type="text" name="datasource">
        <br>
        (No spaces or non alphanumeric characters)
      </p>
      <p>Type of feed:
        <input type="radio" name="feedtype" value="full" checked>
        Full
        <input type="radio" name="feedtype" value="incremental">
        Incremental
        <input type="radio" name="feedtype" value="metadata-and-url">
        Metadata and URL
      </p>
      <p>
        XML file to push:
        <input type="file" name="data">
</p> <p>
<input type="submit" value=">Submit<"> </p> </form> </body> </html>

How a Feed Client Pushes a Feed

When pushing a feed, the feed client sends the POST data to the Google Search Appliance. Here is what a typical POST from a scripted feed client looks like:

POST /xmlfeed HTTP/1.0
Content-type: multipart/form-data
Content-length: 855
Host: myserver.domain.com:19900
User-agent: Python-urllib/1.15
feedtype=full&datasource=sample&data=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22ISO-8859-1%22%3F%3E%0A%3C%21DOCTYPE+gsafeed+SYSTEM+..  

And here is the response from the appliance:

HTTP/1.0 200 OK
Content-Type: text/plain
Date: Thu, 17 Feb 2005 23:16:10 GMT
Server: feedergate_1.0
Connection: Close
Content-Length: 7
Success

The success message indicates that the feedergate process has received the XML file successfully. It does not mean that the content will be added to the index, as this is handled asynchronously by a separate process known as the "feeder". The data source will appear in the Feeds page in the Admin Console once the feeder process has run.

The feeder does not provide automatic notification of a feed error. To check for errors, you must log into the Admin Console and check the status on the Crawl and Index > Feeds page. This page shows the last five feeds that have been uploaded for each data source. The timestamp shown is the time that the XML file has been successfully uploaded by the feedergate server.

You can automate the process of uploading a feed by running your feed client script with a cron job.

Turning Feed Contents Into Search Results

URL Patterns and Trusted IP lists defined in the Admin Console ensure that your index only lists content from desirable sources. When pushing URLs with a feed, you must verify that the Admin Console will accept the feed and allow your content through to the index. For a feed to succeed, it must be fed from a trusted IP address and at least one URL in the feed must pass the rules defined on the Admin Console.

URL Patterns

URLs specified in the feed will only be crawled if they pass through the patterns specified on the Crawl and Index > Crawl URLs page in the Admin Console.

Patterns affect URLs in your feed as follows:

Entries in duplicate hosts also affect your URL patterns. For example, suppose you have a canonical host of foo.mycompany.com with a duplicate host of bar.mycompany.com. If you exclude bar.mycompany.com from your crawl using patterns, then URLs on both foo.mycompany.com and bar.mycompany.com are removed from the index.

Trusted IP Lists

To prevent unauthorized additions to your index, feeds are only accepted from machines that are included in the List of Trusted IP Addresses. To view the list of trusted IP addresses, log into the Admin Console and open the Crawl and Index > Feeds page.

If your Google Search Appliance is on a trusted network, you can disable IP address verification by selecting Trust all IP addresses.

Adding Feed Content

For web feeds, the feeder passes the URLs to the crawl manager. The crawl manager adds the URLs to the crawl schedule. URLs are crawled on the schedule specified by the documentation on the continuous crawler.

For content feeds, the content is provided as part of the XML and does not need to be fetched by the crawler. URLs are passed to the server that maintains Crawl Diagnostics in the Admin Console. This will happen within 15 minutes if your system is not busy. The feeder also passes the URLs and their contents to the indexing process. The URLs will appear in your search results within 30 minutes if your system is not busy

Removing Feed Content From the Index

There are several ways of removing content from your index using a feed. The method used to delete content depends on the kind of feed that has ownership.

For content feeds, remove content by performing one of these actions:

For web feeds, remove content by performing one of these actions:

Note: If a URL is referenced by more than one feed, you will have to remove it from the feed that owns it. See the Troubleshooting entry Fed Documents Aren't Updated or Removed as Specified in the Feed XML for more information.

Time Required to Process a Feed

The following factors can cause the feeder to be slow to add URLs to the index:

In general, the Google Search Appliance can process documents that are pushed as content feeds more quickly than it can crawl and index the same set of documents as a web feed.

License Limits and Page Rank

If your index already contains the maximum number of URLs, or your license limit has been exceeded, then the index is full. When the index is full, the system will try to reduce the number of indexed documents as follows:

Page Rank is assigned to fed pages as follows:

Increasing the Maximum Number of URLs to Crawl

To increase the maximum number of URLs in your index, log into the Admin Console and choose Crawl and Index > Host Loads > Maximum Number of URLs to Crawl. This number must be smaller than the license limit for your Google Search Appliance. To increase the license limit, contact Sales.

Troubleshooting

Here are some things to check if a URL from your feed does not appear in the index. To see a list of known and fixed issues, see the latest release notes for each version on the support site.

Error Messages on the Feeds Status Page

If the Feeds status page shows "Failed in error" you can click the link to view the log file.

ProcessFeed: parsing error

This message means that your XML file could not be understood. Here are some possible causes of this error:

If none of the above are the cause of the error, run xmllint against your XML file to check for errors in the XML. The xmllint program is included in Linux distributions as part of the libxml2 package.

Here is an example that shows how you would use xmllint to test a feed named full-feed.xml.

$ xmllint -noout -valid full-feed.xml; echo $?
0

The return code of zero indicates that the document is both valid and well-formed.

Fed Documents Aren't Appearing in Search Results

Some common reasons why the URLs in your feed might not be found in your search results include:

Fed Documents Aren't Updated or Removed as Specified in the Feed XML

All feeds, including database feeds, share the same name space and assume that URLs are unique. If a fed document doesn't seem to behave as directed in your feed XML, check to make sure that the URL isn't duplicated in your other feeds.

When the same URL is fed into the system by more than one data source, the system uses the following rules to determine how that content should be handled:

Document Status is Stuck "In Progress"

If a document feed gives a status of "In Progress" for more than one hour, this could mean that an internal error has occurred. Please contact Google to resolve this problem, or you can reset your index by going to Administration > Reset Index.

Feed Client TCP Error

If you are using Java to develop your feed client, you may encounter the following exception when pushing a feed:

Java.net.ConnectException: Connection refused: connect

Although it looks like a TCP error, this error may reveal a problem in parsing the MIME boundary parameter syntax, for example, missing a '--' before the argument. MIME syntax discussed in more detail here: http://www.w3.org/Protocols/rfc1341/7_2_Multipart.html

Example Feeds

Here are some examples that demonstrate how feeds are structured:

Web Feed

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>web</datasource>
<feedtype>incremental</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/hello02" mimetype="text/plain"/>
</group>
</gsafeed>

Web Feed with Metadata

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header> <datasource>example3</datasource> <feedtype>metadata-and-url</feedtype>
</header> <group> <record url="http://www.corp.enterprise.com/search/employeesearch.php?q=jwong" action="add" mimetype="text/html" lock="true"> <metadata> <meta name="Name" content="Jenny Wong"/> <meta name="Title" content="Metadata Developer"/> <meta name="Phone" content="x12345"/> <meta name="Floor" content="3"/> <meta name="PhotoURL" content="http://www/employeedir/engineering/jwong.jpg"/> <meta name="URL" content="http://www.corp.enterprise.com/search/employeesearch.php?q=jwong"/> </metadata>
</record>
</group>
</gsafeed>

Full Content Feed

<?xml version="1.0" encoding="UTF8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>sample</datasource>
<feedtype>full</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/hello01" mimetype="text/plain"
last-modified="Tue, 15 Nov 1994 12:45:26 GMT">
<content>This is hello01</content>
</record>
<record url="http://www.corp.enterprise.com/hello02" mimetype="text/plain"
lock="true">
<content>This is hello02</content>
</record>
<record url="http://www.corp.enterprise.com/hello03" mimetype="text/html">
<content><![CDATA[
<html>
<title>namaste</title>
<body>
This is hello03
</body>
</html>
]]></content>
</record>
<record url="http://www.corp.enterprise.com/hello04.html" mimetype="text/html">
<content encoding="base64binary">Zm9vIGJhcgo</content>
</record>
</group>
</gsafeed>

Incremental Content Feed

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>hello</datasource>
<feedtype>incremental</feedtype>
</header>
<group action="delete">
<record url="http://www.corp.enterprise.com/hello01"/>
</group>
<group>
<record url="http://www.corp.enterprise.com/hello02" mimetype="text/plain">
<content>UPDATED - This is hello02</content>
</record>
<record url="http://www.corp.enterprise.com/hello03" action="delete"/>
<record url="http://www.corp.enterprise.com/hello04" mimetype="text/plain">
<content>UPDATED - This is hello04</content>
</record>
</group>
</gsafeed>

The Google Search Appliance Feed DTD

The gsafeed.dtd reproduced below was current when this document was last updated. Please refer to http://<APPLIANCE-HOSTNAME>:7800/gsafeed.dtd for the version used by your Google Search Appliance.

<?xml version="1.0" encoding="UTF-8"?>
     <!ELEMENT gsafeed (header, group+)>
     <!ELEMENT header (datasource, feedtype)>
     <!-- datasource name should match the regex [a-zA-Z_][a-zA-Z0-9_-]*,
     the first character must be a letter or underscore,
     the rest of the characters can be alphanumeric, dash, or underscore. -->
   <!ELEMENT datasource (#PCDATA)>
   <!-- feedtype must be either 'full', 'incremental', or 'metadata-and-url' -->
   <!ELEMENT feedtype (#PCDATA)>
<!-- group element lets you group records together and
     specify a common action for them -->
   <!ELEMENT group (record*)>
<!-- record element can have attribute that overrides group's element-->
     <!ELEMENT record (metadata*,content*)>
     <!ELEMENT metadata (meta*)>
     <!ELEMENT meta EMPTY>
     <!ELEMENT content (#PCDATA)>
<!-- default is 'add' -->
     <!-- last-modified date as per RFC822 -->
     <!ATTLIST group action (add|delete) "add">
     <!ATTLIST record
     url CDATA #REQUIRED
     action (add|delete) #IMPLIED
     mimetype CDATA #REQUIRED
     last-modified CDATA #IMPLIED
     lock (true|false) "false"
     authmethod (none|httpbasic|ntlm|httpsso) "none">
<!ATTLIST meta
     name CDATA #REQUIRED
     content CDATA #REQUIRED>
<!-- if encoding is specified it must be base64binary as that is the only
     binary encoding that is supported -->
   <!ATTLIST content encoding (base64binary) #IMPLIED>
 

Last modified: