Google Search Appliance software version 5.0
Google Mini software version 5.0
Posted: October 2005
Revised: April 2006
Issue Fixes: February 2009
This guide is for developers and administrators of the Google Search Appliance who have documents with metadata that is not stored directly in the primary document. A primary document is a record, file, or web page that the Google Search Appliance treats as a document to index or serve. The guide explains how to use the external metadata indexing capabilities of the Google Search Appliance, either through the use of the Feeds system or the Database Crawler. You should be familiar with the Feeds system and Document Crawler before you read this guide.
Google Search Appliance indexes metadata stored in documents and makes that data available for retrieval at search time. Metadata is data that describes other data. It can provide useful information that can improve the quality of your search results. For example, an HTML document can hold metadata in the <meta> tag to describe the author or keywords for the document. Similarly, Microsoft Office files such as Word documents or Excel spreadsheets often contain metadata fields, such as Title, Subject, Author, Date, and many others.
From the perspective of the Google Search Appliance, there are two primary types of metadata:
<meta> tag. You can configure the Google Search Appliance to index this external metadata and the primary document as a single record.
Because the Google Search Appliance automatically indexes metadata that is stored directly in a primary document, this guide describes how to index metadata that is not stored in the primary document. In this guide, external metadata refers to metadata that is not stored directly in the primary document. The primary document is defined as a record, web page, or any of the over 200 different file types that are can be indexed by the Google Search Appliance and are acquired through the web crawler, database crawler, file system crawler, or feeder system. For more information, see Indexable File Formats.
When you index external metadata, it is searchable in the same way that other metadata is searchable. For example, you can use the partialfields and requiredfields query operators to search for documents with particular metadata. For more information about metadata queries and query operators, see the Search Protocol Guide.
To implement external metadata indexing, you need to identify where your metadata is and describe to the Google Search Appliance how it relates to a primary document. Essentially, you need to answer the following questions:
Your answers to these questions will determine which method of external metadata indexing you should use. The methods for external metadata indexing can be grouped into two main categories:
Within each of these categories, there are a variety of indexing scenarios. The scenario that you should use depends on how your primary document is referenced and stored.
There are three scenarios for indexing external metadata that is stored in a database, depending on how your primary document is referenced and stored. For each of these scenarios, the Google Search Appliance indexes a meta name for each field in your crawl query and meta content for the value in that field.
If you want to use an alias for a field name, you can use the SQL keyword AS in the crawl query to give the field a more meaningful name. For example, if your database has an auth field, you might prefer to give the field an alias of author, because author is a more common search term. Your users could then search for this document by adding &requiredfields=author to their search query URL. Creating aliases for obscurely named fields is especially useful if you want to collect values for the requiredfields or partialfields parameters from your end users.
When using the SQL keyword AS to create an alias for a field, use the alias (instead of the original field name) for the following fields on the Crawl and Index > Databases page:
Metadata: Stored in a database.
Primary Document: A valid URL stored in a single field in the same database that references the primary document.
If your external metadata is stored in a relational database and one of the fields contains fully qualified URLs that reference primary documents, use the following steps to enable external metadata indexing:
In this scenario, the Google Search Appliance queries the database for data, then submits a feed with the resulting rows. The search appliance crawls and indexes the set of records that is defined by the crawl query. The URLs extracted from each external metadata record (as defined by the URL field) are added to the crawl queue and crawled by either the web or file system crawler, following the normal crawl policy. When the primary document is crawled, the contents of the primary document and the external metadata are merged into a single record, which is identified by the URL of the primary document in the search appliance index. After the primary document and the external metadata are indexed, the primary document is returned as a search result when search users query for terms in the external metadata or the primary document.
When you re-index your primary document and external metadata, the replacement behavior for this scenario is the same as for an metadata-and-URL feed. For information about metadata-and-URL feeds, see the Feeds Protocol Developer's Guide.
Metadata: Stored in a database.
Primary Document: A pointer to the primary document needs to be constructed from a base URL and a database value.
This scenario is very similar to the first one, except the URL is constructed from a base URL and a document ID.
If your external metadata is stored in a relational database and the URLs that reference primary documents can be constructed by combining a base URL string and a database field, use the following steps to enable external metadata indexing. The database field usually represents a unique document ID number that, when inserted into a base URL string, references a specific document on a web server or file system. For example, suppose that your primary documents are accessible from a URL of the following form:
http://cmsystem.acme.corp.com:6502/getdoc?action=get&docid=4662118437
In the example, the highlighted number represents a unique document ID stored as a field in the database. You can configure the Google Search Appliance to crawl the metadata and construct URLs that reference primary documents by inserting values from one of the database fields into the base URL.
{docid} tag. If the highlighted document ID in the preceding example was stored in a field called uniqueID, the Document ID Field would be uniqueID and the base URL string would be: http://cmsystem.acme.corp.com:6502/getdoc?action=get&id={docid} In this scenario, the Google Search Appliance queries the database for data, then submits a feed with the resulting rows. The search appliance extracts and indexes the record set that is defined by the crawl query. The URLs constructed from each external metadata record (as defined in Document ID Field and the Base URL field) are added to the crawl queue and crawled by either the web or file system crawler, following the normal crawl policy. When the primary document is crawled, the contents of the primary document and the external metadata are merged into a single record identified by the URL of the primary document in the search appliance index. After the primary document and the external metadata are indexed, search users can query for terms or keywords in the metadata or the primary document and the primary document will return as a search result.
When you re-index your primary document and external metadata, the replacement behavior for this scenario is the same as for an metadata-and-URL feed. For information about metadata-and-URL feeds, see the Feeds Protocol Developer's Guide.
Metadata: Stored in a database.
Primary Document: The primary document is also stored in the database, as a BLOB.
If your external metadata is stored in a relational database and the primary document is also stored in the database as a BLOB (Binary Large OBject), do the following:
SELECT employee_id, dept FROM employee WHERE employee_id = ? and dept = ?
employee_id, dept
In Scenario 3, the Google Search Appliance queries the database for data, then submits a feed with the resulting rows. The search appliance crawls and indexes the set of records that is defined by the crawl query. The specified BLOBs are pushed in a full content feed and are not crawled.
When you re-index your primary document and external metadata, the replacement behavior for this scenario is the same as for a full feed. For information about full feeds, see the Feeds Protocol Developer's Guide.
The remaining scenarios use feeds. Feeds work well when the external metadata is not stored in a relational database, the primary document is not accessible by the search appliance's crawlers, or the reference between the external metadata and the primary document is not easily expressed. You can use a feeds-based solution in any of these cases or any case where you prefer using feeds to implementing the database scenarios.
Feeds are described in general in the Feeds Protocol Developer's Guide. You should be familiar with those concepts before you index external metadata by using feeds.
There are two types of feeds:
The primary document can be pushed in the feed as a content feed or referenced as a web feed. The following scenarios detail how to index external metadata for primary documents that are pushed as content feeds or web feeds.
Metadata: Inserted into the feed XML file.
Primary Document: Inserted into the feed XML file (content feed).
In this scenario, you need to write a script or code that generates the feed XML file and then push the feed XML file to the Google Search Appliance. Use the following steps:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>sample</datasource>
<feedtype>full</feedtype>
</header>
<record> element for each primary document. In the <metadata> element, insert one or more <meta> elements, as shown in the following example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>sample</datasource>
<feedtype>full</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/hello01" mimetype="text/plain"
last-modified="Tue, 15 Nov 1994 12:45:26 GMT">
<metadata>
<meta name="author" content="Jones"/>
<meta name="project" content="hello01"/>
<meta name="department" content="engineering"/>
</metadata>
<content> element of the record, as shown in the following example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>sample</datasource>
<feedtype>full</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/hello01" mimetype="text/plain"
last-modified="Tue, 15 Nov 1994 12:45:26 GMT">
<metadata>
<meta name="author" content="Jones"/>
<meta name="project" content="hello01"/>
<meta name="department" content="engineering"/>
</metadata>
<content> This is hello02 content. </content>
</record>
</group>
</gsafeed>
Note: The previous example uses a full feed (<feedtype>full</feedtype>). With full feeds, fed documents are removed from the index within six hours. For more information, see Removing Feed Content From the Index in the Feeds Protocol Developer's Guide. You can use an incremental feed to avoid fed documents being removed from the index by replacing the <feedtype>full element with the <feedtype>incremental</feedtype> element.
If the content is text-based content, it can be inserted directly into the feed XML file. If it is non-text content (.pdf, .doc, and other file types), you need to base64-encode the content and set the record's encoding attribute to encoding="base64binary", as described in the Feeds Protocol Developer's Guide.
You should be aware that when you update a content feed for a primary document, the Google Search Appliance does not automatically update the associated metadata feed in its index unless the corresponding record has a <metadata> section. Similarly, if you update a metadata feed, the Google Search Appliance does not automatically update the associated primary document feed in its index. If you want to update a content feed and a metadata feed, you should explicitly push both of these feeds to the index.
Metadata: Inserted into the feed XML file.
Primary Document: Referenced by the URL in the feed XML file (web feed).
This scenario is similar to the previous scenario, except that the primary document is referenced by URL only (instead of the contents of the primary document being fed to the Google Search Appliance).
The feed file therefore contains the <header> information and, for each <record> element, the URL of the record and the <metadata> elements.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>sample2</datasource>
<feedtype>metadata-and-url</feedtype>
</header>
Note that the <feedtype> element is metadata-and-url. This tells the web or file system crawler to pick up the URLs for the primary document and index them accordingly. <record> element for each primary document. In the <metadata> element, insert one or more <meta> elements, as shown in the following example:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE gsafeed PUBLIC "-//Google//DTD GSA Feeds//EN" "">
<gsafeed>
<header>
<datasource>sample2</datasource>
<feedtype>metadata-and-url</feedtype>
</header>
<group>
<record url="http://www.corp.enterprise.com/hello02" mimetype="text/plain"
last-modified="Tue, 15 Nov 1994 12:45:26 GMT">
<metadata>
<meta name="author" content="Stevens"/>
<meta name="project" content="hello02"/>
<meta name="department" content="HR"/>
</metadata>
The Google Search Appliance provides document-level access control to secure the search content. When you index external metadata, the Google Search Appliance applies the following access controls:
Last modified: