Google Search Appliance software version 6.0
Posted June 2009
This document is for developers who use the Google Search Appliance Feeds Protocol to develop custom feed clients that push content and metadata to the search appliance for processing, indexing, and serving as search results.
To push content to the search appliance, you require a feed and a feed client:
This document explains how feeds work and shows you how to write a basic feed client.
Skip over ContentsYou can use feeds to push data into the index on the search appliance. There are two types of feeds:
Web feeds and content feeds also behave differently when deleting content. See Removing Feed Content from the Index for a description of how content is deleted from each type of feed.
To see an example of a feed, follow the steps in the Quickstart section of this document.
You should design a feed to ensure that your search appliance crawls any documents that require special handling. Consider whether your site includes content that cannot be found through links on crawled web pages, or content that is most useful when it is crawled at a specific time.
Examples of documents that are best pushed using feeds include:
You push the XML to the search appliance using a feed client. You can use one of the feed clients described in this document or write your own. To write your own feed client, you should be familiar with these technologies:
Here are steps for pushing a content feed to the search appliance.
http://www.localhost.test.comThis is the URL for the document defined in sample_feed.xml.
% pushfeed_client.py --datasource="sample" --feedtype="full" --url="http://<APPLIANCE-HOSTNAME>:19900/xmlfeed" --xmlfilename="sample_feed.xml"
http://www.localhost.test.com/ should appear under Crawl Diagnostics within about 15 minutes. info:http://www.localhost.test.com/If your system is not busy, the URL should appear in your search results within 30 minutes.
The feed is an XML file that contains the URLs. It may also contain their contents, metadata, and additional information such as the last-modified date. The XML must conform to the schema defined by gsafeed.dtd. This file is also available on your search appliance at http://<APPLIANCE-HOSTNAME>:7800/gsafeed.dtd. Although the Document Type Definition (DTD) defines elements for the data source name and the feed type, these elements are populated when you push the feed to the search appliance. Any datasource or feedtype values that you specify within the XML document are ignored.
An XML feed must be less than 1 GB in size. If your feed is larger than 1 GB, consider breaking the feed into smaller feeds that can be pushed more efficiently.
When you push a feed to the search appliance, the system associates the fed URLs with a data source name, specified by the datasource element in the feed DTD.
metadata-and-url, the system treats the feed as a web feed. metadata-and-url, the system treats the feed as a content feed. To view all of the feeds for your search appliance, log into the Admin Console and choose Crawl and Index > Feeds. The list shows the date of the most recent push for each data source name, along with whether the feed was successful and how many documents were pushed.
Note: Although you can specify the feed type and data source in the XML file, the values specified in the XML file are currently unused. Instead, the search appliance uses the data source and feed type that are specified during the feed upload step. However, we recommend that you include the data source name and feed type in the XML file for compatibility with future versions.
The feed type determines how the search appliance handles URLs when a new content feed is pushed with an existing data source name.
Content feeds can be full or incremental; a web feed is always incremental. To support feeds that provide only URLs and metadata, you can also set the feed type to metadata-and-url. This is a special feed type that is treated as a web feed.
feedtype element is set to full for a content feed, the system deletes all the prior URLs that were associated with the data source. The new feed contents completely replace the prior feed contents. If the feed contains metadata, you must also provide content for each record; a full feed cannot push metadata alone. You can delete all documents in a data source by pushing an empty full feed.feedtype element is set to incremental, the system modifies the URLs that exist in the new feed as specified by the action attribute for the record. URLs from previous feeds remain associated with the content data source. If the record contains metadata, you can incrementally update either the content or the metadata. For example, if your incremental feed specifies metadata for the record without any content information, the metadata will update but the content will remain unchanged; if your incremental feed only specifies content for the record, the content will update but the metadata will remain unchanged. You can only update metadata without content for records that had content in an earlier push.
feedtype element is set to metadata-and-url, the system modifies the URLs and metadata that exist in the new feed as specified by the action attribute for the record. URLs and metadata from previous feeds remain associated with the content data source. You can use this feed type even if you do not define any metadata in the feed. The system treats any data source with this feed type as a special kind of web feed and updates the feed incrementally.Documents that have been fed by using content feeds are specially marked so that the crawler will not attempt to crawl them. To update the document, you need to feed the updated document to the search appliance. Documents fed with web feeds, including metadata-and-urls, are be recrawled periodically, based on the crawl settings for the search appliance.
Note: The metadata-and-url feed type is one way to provide metadata to the search appliance. A connector can also provide metadata to the search appliance. See the External Metadata Indexing Guide for more information about external metadata. See also Content Feed and Metadata-and-URL Feed in the Connector Developer's Guide.
Incremental feeds generally require fewer system resources than full feeds. A large feed can often be crawled more efficiently if it is divided into smaller incremental feeds.
Here is an example that illustrates the effect of a full feed:
Here is an example that mixes full and incremental feeds:
You include documents in your feed by defining them inside a record element. All records must specify a URL which is used as the unique identifier for the document. If the original document doesn't have a URL, but has some other unique identifier, you must map the document to a unique URL in order to identify it in the feed.
Each record element can specify following attributes:
url (required) - The URL is the unique identifier for the document. This is the URL used by the search appliance when crawling and indexing the document. All URLs must contain a FQDN (fully qualified domain name) in the host part of the URL. Because the URL is provided as part of an XML document, you must escape any special characters that are reserved in XML. For example, the URL http://www.mydomain.com/bar?a=1&b2 contains an ampersand character and should be rewritten to http://www.mydomain.com/bar?a=1&b2. displayurl - The URL that should be provided in search results for a document. The search appliance enables use of the display URL only for content feeds and not for web feeds. This attribute is useful for web-enabled content systems where a user expects to obtain a URL with full navigation context and other application-specific data, but where a page does not give the search appliance easy access to the indexable content. action - Set action to "add" when you want the feed to overwrite and update the contents of a URL. If you don't specify an action, the system performs an add. For content feeds only, set action to "delete" to remove a URL from the index. For more information on how to delete content from a web feed, see Removing Feed Content From the Index. lock - The lock attribute can be set to true or false (the default is false). When the search appliance reaches its license limit, unlocked documents are deleted to make room for more documents. After all other remedies are tried and if the license is still at its limit, then locked documents are deleted. For more information, see License Limits.mimetype (required) - This attribute tells the system what kind of content to expect from the content element. All MIME types that can be indexed by the search appliance are supported.
Note: Even though the feeds DTD marks mimetype as required, mimetype is required only for content feeds and is ignored for web and metadata-and-url feeds (even though you are required to specify a value). The search appliance ignores the MIME type in web and metadata-and-URL feeds because the search appliance determines the MIME type when it crawls and indexes a URL.
last-modified - Populate this attribute with the date time format specified in RFC822 (Mon, 15 Nov 2004 04:58:08 GMT). If you do not specify a last-modified date, then the implied value is blank. The system uses the rules specified in the Admin Console under Crawl and Index > Document Dates to choose which date from a document to use in the search results. The document date extraction process runs periodically so there may be a delay between the time a document appears in the results and the time that its date appears. authmethod - This attribute tells the system how to crawl URLs that are protected by NTLM, HTTP Basic, or Single Sign-on. The authmethod attribute can be set to none, httpbasic, ntlm, or httpsso. By default, it is set to none. If you want to enable crawling for protected documents, see Including Protected Documents in Search Results.pagerank - This attribute is not yet supported.The optional group element allows you to apply an action to many records at once. For example, this:
<group action="delete"> <record url="http://www.corp.enterprise.com/hello01"/> <record url="http://www.corp.enterprise.com/hello02"/> <record url="http://www.corp.enterprise.com/hello03"/> </group>
Is equivalent to this:
<record url="http://www.corp.enterprise.com/hello01" action="delete"/> <record url="http://www.corp.enterprise.com/hello02" action="delete"/> <record url="http://www.corp.enterprise.com/hello03" action="delete"/>
However, if you define any actions for records as a group, the record's definition always overrides the group's definition. For example:
<group action="delete"> <record url="http://www.corp.enterprise.com/hello01"/> <record url="http://www.corp.enterprise.com/hello02" action="add"/> <record url="http://www.corp.enterprise.com/hello03"/> </group>
In this example, hello01 and hello03 would be deleted, and hello02 would be updated.
You add document content by placing it inside the record definition for your content feed. For example, using text content:
<record url="..." mimetype="text/plain"> <content>Hello world. Here is some page content.</content>
</record>
You can also define content as HTML:
<record url="..." mimetype="text/html"> <content><![CDATA[<html> <title>hello world</title>
<body> <p>Here is some page content.</p>
</body> </html>]]></content>
</record>
To include non-text documents such as .pdf or .doc files, you must encode the content using base64 encoding. Using base64 encoding ensures that the feed can be parsed as valid XML.
Here is a record definition that includes base64 encoded content:
<record url="..." mimetype="text/plain"> <content encoding="base64binary">Zm9vIGJhcgo</content>
</record>
Because base64 encoding increases the document size by one third, it is often more efficient to include non-text documents as URLs in a web feed. Only contents that are embedded in the XML feed must be encoded; this restriction does not apply to contents that are crawled.
Metadata can be included in record definitions for different types of feeds.
The following table provides information about incremental web feeds and metadata-and-URL feeds.
| Data Source Name | Feed Type | Push Behavior | Allows Metadata? | Allows Content? |
|---|---|---|---|---|
web |
incremental |
incremental | no | no |
| any | metadata-and-url |
incremental | yes | no |
The following table provides information about incremental and full content feeds.
| Data Source Name | Feed Type | Push Behavior | Allows Metadata? | Allows Content? |
|---|---|---|---|---|
| any | incremental |
incremental | yes | yes |
| any | full |
full | yes | yes |
If the metadata is part of a feed, it must have the following format:
<record url="..." ...>
<metadata>
<meta name="..." content="..." />
<meta name="..." content="..." />
</metadata>
...
</record>
Note: The content= attribute cannot be an empty string (""). For more information, see Document Feeds Successfully But Then Fails.
See the External Metadata Indexing Guide for more information about indexing external metadata and examples of metadata feeds.
Feeds can push protected contents to the search appliance. If your feed contains URLs that are protected by NTLM, Basic Authentication, or Forms Authentication (Single Sign-on), the URL record in the feed must specify the correct type of authentication. You must also configure settings in the Admin Console to allow the search appliance to crawl the secured pages.
The authmethod attribute for the record defines the type of authentication. By default, authmethod is set to "none". To enable secure search from a feed, set the authentication attribute for the record to ntlm, httpbasic, or httpsso. For example, to enable authentication for protected files on localhost.test.com via Forms Authentication, you would define the record as:
<record url="http://www.localhost.test.com/" authmethod="httpsso">
To grant the search appliance access to the protected pages in your feed, log into the Admin Console.
For URLs that are protected by NTLM and Basic Authentication, follow these steps:
For URLs that are protected by Single Sign-on, follow these steps:
This is one way of providing access to protected documents. For more information on authentication, refer to the online help that is available in the search appliance's Admin Console, and Managing Search for Controlled-Access Content.
To push records from a database into the search appliance's index, you use a special content feed that is generated by the search appliance based on parameters that you set in the Admin Console. To set up a feed for database content, log into the Admin Console and choose Crawl and Index > Database. You can find more information on how to define a database-driven data source in the online help that is available in the Admin Console, and in the support document Database Crawling and Serving.
Records from a database cannot be served as secure content.
You should save a backup copy of your XML Feed in case you need to push it again. For example, if you perform a version update that requires you to rebuild the index, you must push all your feeds again to restore them to the search appliance. The search appliance does not archive copies of your feeds.
The following file size limitations apply to fed files:
This section describes how to design a feed client. If you don't want to design your own feed client script, you can use one of the following methods to push your feed:
cron job to automate feeds. Important: The IP address of the computer that hosts the feed client must be in the List of Trusted IP Addresses. In the Admin Console, go to Crawl and Index > Feeds, and scroll down to List of Trusted IP Addresses. Verify that the IP address for your feed client appears in this list.
You upload an XML feed using an HTTP POST to the feedergate server located on port 19900 of your search appliance. Feeds cannot be uploaded using HTTPS. An XML feed must be less than 1 GB in size. If your feed is larger than 1 GB, consider breaking the feed into smaller feeds that can be pushed more efficiently.
The feedergate server requires three input parameters from the POST operation:
datasource specifies the name of the data source. Your choice of data source name also implies the type of feed: for a web feed, the datasource name must be "web". feedtype specifies how the system pushes the feed. The feedtype value must be "full", "incremental", or "metadata-and-url".data specifies the XML feed to push with this data source and feed type. Note that although the data parameter may contain a data source and a feed type definition as part of the XML, these will be ignored by the search appliance. Only the data source and feed type provided as POST input parameters are used. The URL that you should use is:
http://<APPLIANCE-HOSTNAME>:19900/xmlfeed
You should post the feed using enctype="multipart/form-data".
Although the search appliance supports uploads using enctype="application/x-www-form-urlencoded", this encoding type is not recommended for large amounts of data.
Here is an example of a simple HTML form for pushing a feed to the search appliance. Because the web form requires user input, this method cannot be automated.
To adapt this form for your search appliance, replace APPLIANCE-HOSTNAME with the fully qualified domain name of your search appliance.
<html>
<head>
<title>Simple form for pushing a feed</title>
</head>
<body>
<h1>Simple form for pushing a feed</h1>
<form enctype="multipart/form-data" method=POST
action="http://<APPLIANCE-HOSTNAME>:19900/xmlfeed">
<p>Name of datasource:
<input type="text" name="datasource">
<br>
(No spaces or non alphanumeric characters)
</p>
<p>Type of feed:
<input type="radio" name="feedtype" value="full" checked>
Full
<input type="radio" name="feedtype" value="incremental">
Incremental
<input type="radio" name="feedtype" value="metadata-and-url">
Metadata and URL
</p>
<p>
XML file to push:
<input type="file" name="data">
</p>
<p>
<input type="submit" value=">Submit<">
</p>
</form>
</body>
</html>
When pushing a feed, the feed client sends the POST data to a search appliance. A typical POST from a scripted feed client appears as follows:
POST /xmlfeed HTTP/1.0 Content-type: multipart/form-data Content-length: 855 Host: myserver.domain.com:19900 User-agent: Python-urllib/1.15 feedtype=full&datasource=sample&data=%3C%3Fxml+version%3D%221.0%22+encoding%3D%22ISO-8859-1%22%3F%3E%0A%3C%21DOCTYPE+gsafeed+SYSTEM+..
The response from the search appliance is as follows:
HTTP/1.0 200 OK Content-Type: text/plain Date: Thu, 30 Apr 2009 23:16:10 GMT Server: feedergate_1.0 Connection: Close Content-Length: 7
Success
The success message indicates that the feedergate process has received the XML file successfully. It does not mean that the content will be added to the index, as this is handled asynchronously by a separate process known as the "feeder". The data source will appear in the Feeds page in the Admin Console after the feeder process runs.
The feeder does not provide automatic notification of a feed error. To check for errors, you must log into the Admin Console and check the status on the Crawl and Index > Feeds page. This page shows the last five feeds that have been uploaded for each data source. The timestamp shown is the time that the XML file has been successfully uploaded by the feedergate server.
You can automate the process of uploading a feed by running your feed client script with a cron job.
URL Patterns and Trusted IP lists defined in the Admin Console ensure that your index only lists content from desirable sources. When pushing URLs with a feed, you must verify that the Admin Console will accept the feed and allow your content through to the index. For a feed to succeed, it must be fed from a trusted IP address and at least one URL in the feed must pass the rules defined on the Admin Console.
URLs specified in the feed will only be crawled if they pass through the patterns specified on the Crawl and Index > Crawl URLs page in the Admin Console.
Patterns affect URLs in your feed as follows:
Entries in duplicate hosts also affect your URL patterns. For example, suppose you have a canonical host of foo.mycompany.com with a duplicate host of bar.mycompany.com. If you exclude bar.mycompany.com from your crawl using patterns, then URLs on both foo.mycompany.com and bar.mycompany.com are removed from the index.
To prevent unauthorized additions to your index, feeds are only accepted from machines that are included in the List of Trusted IP Addresses. To view the list of trusted IP addresses, log into the Admin Console and open the Crawl and Index > Feeds page.
If your search appliance is on a trusted network, you can disable IP address verification by selecting Trust all IP addresses.
For web feeds, the feeder passes the URLs to the crawl manager. The crawl manager adds the URLs to the crawl schedule. URLs are crawled on the schedule specified by the documentation on the continuous crawler.
For content feeds, the content is provided as part of the XML and does not need to be fetched by the crawler. URLs are passed to the server that maintains Crawl Diagnostics in the Admin Console. This will happen within 15 minutes if your system is not busy. The feeder also passes the URLs and their contents to the indexing process. The URLs will appear in your search results within 30 minutes if your system is not busy.
There are several ways of removing content from your index using a feed. The method used to delete content depends on the kind of feed that has ownership.
For content feeds, remove content by performing one of these actions:
For web feeds, remove content by performing one of these actions:
Note: If a URL is referenced by more than one feed, you will have to remove it from the feed that owns it. See the Troubleshooting entry Fed Documents Aren't Updated or Removed as Specified in the Feed XML for more information.
The following factors can cause the feeder to be slow to add URLs to the index:
In general, the search appliance can process documents that are pushed as content feeds more quickly than it can crawl and index the same set of documents as a web feed.
To view a count of how many feed files remain for the search appliance to process into its index, add /getbacklogcount to a search appliance URL at port 19900. The count that this feature provides can be used to regulate the feed submission rate. The count also includes connector feed files.
The syntax for /getbacklogcount is as follows:
http://SearchApplianceHostname:19900/getbacklogcount
You can change the display URL on search results by pushing a metadata-and-url feed with the displayurl attribute set.
Use this feature when the search appliance crawls one URL, but you want the display URL in the search results to appear as a different URL.
The following example changes the display URL from http://me.example.com/myfile.html to http://The_new_URL/newurlpage.html. The example shows a metadata entry on the feed due to its type, metadata-and-url. Any metadata works with this feature.
Without the <metadata> entry, the feed is discarded.
The link also changes to this new display URL.
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>replace</datasource>
<feedtype>metadata-and-url</feedtype>
</header>
<group>
<record url="http://me.example.com/myfile.html?xx=x"
displayurl="http://The_new_URL/newurlpage.html"
action="add" mimetype="text/html" lock="true">
<metadata>
<meta name="displayurl"
content="http://The_new_URL/newurlpage.html"/>
</metadata>
</record>
</group>
</gsafeed>
If your index already contains the maximum number of URLs, or your license limit has been exceeded, then the index is full.
When the index is full, the system reduces the number of indexed documents as follows:
lock attribute set to true are deleted last. To increase the maximum number of URLs in your index, log into the Admin Console and choose Crawl and Index > Host Loads > Maximum Number of URLs to Crawl. This number must be smaller than the license limit for your search appliance. To increase the license limit, contact Sales.
Here are some things to check if a URL from your feed does not appear in the index. To see a list of known and fixed issues, see the latest release notes for each version.
If the feeds status page shows "Failed in error" you can click the link to view the log file.
This message means that your XML file could not be understood. The following are some possible causes of this error:
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN""">If none of the above are the cause of the error, run xmllint against your XML file to check for errors in the XML. The xmllint program is included in Linux distributions as part of the libxml2 package.
The following is an example that shows how you would use xmllint to test a feed named full-feed.xml.
$ xmllint -noout -valid full-feed.xml; echo $?
0
The return code of zero indicates that the document is both valid and well-formed.
If the xmllint command fails and displays the parsing error message, ensure that you have the correct DTD file, or you can remove the -valid flag from the xmllint command line so that the xmllint command doesn't try to validate the XML file's elements. For more information on the DTD, see Google Search Appliance Feed DTD.
Before a search appliance can start processing a feed, you need to successfully push a feed to port 19900.
If the feed push is not successful, check the following:
tracepath applianceIP/19900 from the Linux command line. Some common reasons why the URLs in your feed might not be found in your search results include:
feedtype element set to incremental or full. Incremental can only be used on a content feed. If this is the case, the feed is treated as a content feed and not crawled. Once a URL is part of a content feed, the feed is not recrawled even if you later send a web or metadata feed. If you run into this issue, remove the URL from the URL pattern (or click the Delete link on the feeds page) and after the feed URLs have been deleted, put the URL patterns back, and send a proper metadata-and-url feed.info:[url] where [url] is the full URL to a document fed into the search appliance. Or use url:[path] where [path] is part of the URL to documents fed into the search appliance.&access=a" is somewhere in the query URL that you are sending to the search appliance. See Including Protected Documents in Search Results.A content feed reports success at the feedergate, but thereafter, reports the following document feed error:
Failed in error documents included: 0 documents in error: 1 error details: Skipping the record, Line number: nn, Error: Element record content does not follow the DTD, Misplaced metadata
This error occurs when a metadata element contains a content attribute with an empty string, for example:
<meta name="Tags" content=""/>
If the content attribute value is an empty string:
meta tag from the metadata element, or: content attribute to show that no value is assigned. Choose a value that is not used in the metadata element, for example, _noname_:
<meta name="Tags" content="_noname_"/>
You can then use the inmeta search keyword to find the attribute value in the fed content, for example:
inmeta:tags~_noname_
All feeds, including database feeds, share the same name space and assume that URLs are unique. If a fed document doesn't seem to behave as directed in your feed XML, check to make sure that the URL isn't duplicated in your other feeds.
When the same URL is fed into the system by more than one data source, the system uses the following rules to determine how that content should be handled:
If a document feed gives a status of "In Progress" for more than one hour, this could mean that an internal error has occurred. Please contact Google to resolve this problem, or you can reset your index by going to Administration > Reset Index.
If you are using Java to develop your feed client, you may encounter the following exception when pushing a feed:
Java.net.ConnectException: Connection refused: connect
Although it looks like a TCP error, this error may reveal a problem in parsing the MIME boundary parameter syntax, for example, missing a '--' before the argument. MIME syntax discussed in more detail here: http://www.w3.org/Protocols/rfc1341/7_2_Multipart.html
Here are some examples that demonstrate how feeds are structured:
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>web</datasource>
<feedtype>incremental</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/hello02" mimetype="text/plain"/>
</group>
</gsafeed>
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>example3</datasource>
<feedtype>metadata-and-url</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/search/employeesearch.php?q=jwong"
action="add" mimetype="text/html" lock="true">
<metadata>
<meta name="Name" content="Jenny Wong"/>
<meta name="Title" content="Metadata Developer"/>
<meta name="Phone" content="x12345"/>
<meta name="Floor" content="3"/>
<meta name="PhotoURL"
content="http://www/employeedir/engineering/jwong.jpg"/>
<meta name="URL"
content="http://www.corp.enterprise.com/search/employeesearch.php?q=jwong"/>
</metadata>
</record>
</group>
</gsafeed>
<?xml version="1.0" encoding="UTF8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>sample</datasource>
<feedtype>full</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/hello01" mimetype="text/plain"
last-modified="Tue, 6 Nov 2007 12:45:26 GMT">
<content>This is hello01</content>
</record>
<record url="http://www.corp.enterprise.com/hello02" mimetype="text/plain"
lock="true">
<content>This is hello02</content>
</record>
<record url="http://www.corp.enterprise.com/hello03" mimetype="text/html">
<content><![CDATA[
<html>
<title>namaste</title>
<body>
This is hello03
</body>
</html>
]]></content>
</record>
<record url="http://www.corp.enterprise.com/hello04" mimetype="text/html">
<content encoding="base64binary">Zm9vIGJhcgo</content>
</record>
</group>
</gsafeed>
<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>hello</datasource>
<feedtype>incremental</feedtype>
</header>
<group action="delete">
<record url="http://www.corp.enterprise.com/hello01"/>
</group>
<group>
<record url="http://www.corp.enterprise.com/hello02" mimetype="text/plain">
<content>UPDATED - This is hello02</content>
</record>
<record url="http://www.corp.enterprise.com/hello03" action="delete"/>
<record url="http://www.corp.enterprise.com/hello04" mimetype="text/plain">
<content>UPDATED - This is hello04</content>
</record>
</group>
</gsafeed>
The gsafeed.dtd file follows. You can view the DTD on your search appliance
by browsing to the http://<APPLIANCE-HOSTNAME>:7800/gsafeed.dtd URL.
<?xml version="1.0" encoding="UTF-8"?>
<!ELEMENT gsafeed (header, group+)>
<!ELEMENT header (datasource, feedtype)>
<!-- datasource name should match the regex [a-zA-Z_][a-zA-Z0-9_-]*,
the first character must be a letter or underscore,
the rest of the characters can be alphanumeric, dash, or underscore. -->
<!ELEMENT datasource (#PCDATA)>
<!-- feedtype must be either 'full', 'incremental', or 'metadata-and-url' -->
<!ELEMENT feedtype (#PCDATA)>
<!-- group element lets you group records together and
specify a common action for them -->
<!ELEMENT group (record*)>
<!-- record element can have attribute that overrides group's element-->
<!ELEMENT record (metadata*,content*)>
<!ELEMENT metadata (meta*)>
<!ELEMENT meta EMPTY>
<!ELEMENT content (#PCDATA)>
<!-- default is 'add' -->
<!-- last-modified date as per RFC822 -->
<!ATTLIST group
action (add|delete) "add"
pagerank CDATA #IMPLIED>
<!ATTLIST record
url CDATA #REQUIRED
displayurl CDATA #IMPLIED
action (add|delete) #IMPLIED
mimetype CDATA #REQUIRED
last-modified CDATA #IMPLIED
lock (true|false) "false"
authmethod (none|httpbasic|ntlm|httpsso) "none"
pagerank CDATA #IMPLIED>
<!ATTLIST meta
name CDATA #REQUIRED
content CDATA #REQUIRED>
<!-- if encoding is specified it must be base64binary as that is the only
binary encoding that is supported -->
<!ATTLIST content encoding (base64binary) #IMPLIED>