Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Cookie Sites
  Forms Authentication
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Index Rollback
  Freshness Tuning
  Collections

Serving

Status and Reports

Administration

More Information

Crawl and Index > Databases

The search appliance can crawl your databases and show search results from the databases to users' queries. You need to supply information to allow crawl access to each database. You enter this information on the Crawl and Index > Databases page.

When you set up a database crawl, you need to include entries in the Follow and Crawl Only URLs with the Following Patterns field on the Crawl and Index > Crawl URLs page.

For example:

Follow and Crawl Only URLs with the Following Patterns:

To include all database feeds, use this crawl pattern:

  ^googledb://

To include a specific database feed:

^googledb://<database_hostname >/<database_source_name>/

For software versions 4.2.2 through 4.4.94 only:

  • To include all database feeds: http://<appliance_ip_address>/db/
  • To include a specific database feed: http://<appliance_ip_address>/db/<database_source_name>/

Note: These URLs and URL patterns are case sensitive. If you use uppercase for the database source name, you must use the same uppercase in the crawl start URLs and crawl patterns.

If your data source contains a URL column with URLs that point to your own website, add those URL patterns under Follow and Crawl Only URLs with the Following Patterns on the Crawl and Index > Crawl URLs page.

Here is the information about your databases that you need to have ready. The first seven entries are used by the system to talk to the external database server.

  • Source Name - a name for the data source. The database entry name must match the [a-zA-Z_][a-zA-Z0-9_-]* pattern, that is, you can use letters or an underscore for the first character, and alphanumeric characters, underscores, and dashes in the rest of the name.
  • Database Type - choose from DB2, Oracle, MySQL, MS SQL Server, or Sybase.
  • Hostname - name of the server where the database resides.
  • Port - the port number that is open to the database that JDBC should connect to.
  • Database Name - the name given to the database. The database name must consist of alphanumeric characters.
  • Username - user name to access the database.
  • Password - password for the database.
  • Crawl Query - a SQL statement accepted by the targeted database software that returns all rows to be indexed. See example.
  • Usage - Choose the stylesheet for displaying database results and configure the search appliance to index external metadata.
    • Data Display - choose from a default stylesheet for displaying results or upload a stylesheet from your network. (To view the default stylesheet, log on to the Google Support site. You can download it from there and make changes to it, then upload it on the Crawl and Index > Databases page.)
    • Metadata - select if you need to index metadata that is stored in a database, but not stored directly in the primary document that it describes.
      • Document URL Field - if your database contains a column with complete URLs that point to primary documents, enter the name of the database column that holds the URLs that point to the primary documents.
      • Document ID Field and Base URL - if your database contains a column with document IDs that need to be combined with a base URL to point to primary documents:
        • In the Document ID Field, enter the name of the database column that holds values that are used to construct primary document URLs.
        • In the BASE URL field, enter the base URL that is used to construct the complete URLs of primary documents. The base URL should have the format http://www.baseurl/docnum={docID} where {docID} represents the values in the column specified in the Document ID Field.
      • BLOB - select if your database contains primary documents (stored as BLOBs) and related external metadata.
      • BLOB MIME Type Field - enter the database column that specifies the standard Internet MIME type of the BLOB.
      • BLOB Content Field - enter the database column that contains the BLOB data.
  • Serving Interface - choose either Serve Query or Serve URL Field
    • Serve Query - a SQL statement that returns a row in a document that matches a search query. See example.
      Primary Key Fields - Column heading names (separated by commas), such as Last_Name,First_Name,SSN,Birth_Date, etc.
    • Serve URL Field - If your database records already have URLs that display them, you should specify the database column that contains the URL. For example, in a company directory, if an HTML page exists for each record, and the links are always in the same format (such as http://corp.company.com/hr/Joe_Employee.html), then the appliance displays that link when it serves results. Specify the name of the field that contains the URL, such as "Employee_name".
The Advanced Settings section lets you define additional database information for the appliance to crawl.
  • Incremental Crawl Query - a SQL statement that targets insertions, updates, and deletions in the database
    Action Field - the name of the column that lists the modification type; valid values for the Action field are "add" or "delete".
  • BLOB MIME Type field - the name of the column that contains the standard Internet MIME type values of Binary Large Objects, such as text/plain and text/html.
  • BLOB Content field - the name of the column that contains the types of BLOB content, such as documents.

The creation of a database source results in the automatic entry of the source in the Crawl and Index > Feeds page.

The search appliance usually transforms data from crawled pages, which protects against security vulnerabilities. If you cause the search appliance to crawl BLOB content by filling in these advanced settings, certain conditions could open a vulnerability. The vulnerability exists only if both of these conditions are true:

  • A perpetrator has access to the database table.
  • You are using secure search, which causes the search appliance to request usernames and passwords or other credentials.

Examples of Serve Queries

Note: The primary key as entered in the field called Primary Key Fields needs to be part of the database query (either by "select * ..." or by "select <primary_key> ..."

Suppose an "employee" database has these fields:

    employee_id, first_name, last_name, email, dept
The following are possible crawl and serve queries.

Crawl query:

    SELECT employee_id, first_name, last_name, email, dept
    FROM employee

Serve query:

    SELECT employee_id, first_name, last_name, email, dept
    FROM employee
    WHERE employee_id = ?

    The Primary Key Field for this example is: employee_id

For a database with multiple column primary keys, if the combination of employee_id, dept is unique, then:

Crawl query can be the same as the one above.

Serve query:

    SELECT employee_id, first_name, last_name, email, dept
    FROM employee
    WHERE employee_id = ? and dept = ?

To configure crawling a database:

  1. Click Crawl and Index and then click Databases.
  2. Enter your database information in the fields. All fields down to Advanced Settings are required. Refer to the section above for definitions.
  3. Click the Create Database Data Source button.
  4. Click the Sync link.

Note: If you see any issues with data sources, such as getting 404 errors using the "View Log" link, you can usually resolve them by clicking the Sync link next to a database entry again.

To edit an existing database configuration:

  1. Click the Edit link next to the database you want to edit.
  2. Enter your changes in the form.
  3. Click the Save Database Configuration button.
  4. Click the Sync link.

To delete a database configuration:

  1. Select the Delete link to the right of the database name.
  2. Click Yes to confirm the deletion.

 
© Google Inc. 2007