Administering Crawl for Web and File Share Content: Preparing for a Crawl

Google Search Appliance software version 4.6
Google Mini software version 4.6
Posted July 2007

This document provides an overview of how the Google Search Appliance and the Google Mini crawl and index enterprise content.

For the Google Search Appliance, information about continuous crawl applies to software version 4.2, and information about full crawl and file system crawl applies to software version 4.6 and later.

For the Google Mini, all information applies to software version 4.4 and later.

Contents

  1. Preparing Data for Crawl
    1. Using robots.txt to Control Access to a Content Server
      1. Using the Allow Directive
    2. Using Robots META Tags to Control Access to a Web Page
    3. Excluding Unwanted Text from the Index
    4. Using no_crawl Directories to Control Access to Files and Subdirectories
    5. Preparing Shared Folders in File Systems
    6. Ensuring that Unlinked URLs Are Crawled
  2. Configuring a Crawl
    1. Start Crawling from the Following URLs
    2. Follow and Crawl Only URLs with the Following Patterns
    3. Do Not Crawl URLs with the Following Patterns
    4. Testing Your URL Patterns
  3. About SMB URLs
    1. Unsupported SMB URLs
    2. SMB URLs for Non-file Objects
    3. Hostname Resolution
  4. Setting Up the Crawler's Access to Secure Content
  5. Configuring Searchable Dates
  6. Defining Document Date Rules

Back to top

Preparing Data for a Crawl

Before the Google Search Appliance or Google Mini crawls your enterprise content, people in various roles may want to prepare the content to meet the objectives described in the following table.

Objective Role
Control access to a content server Content server administrator, webmaster
Control access to a Web page Search appliance administrator, webmaster, content owner, and/or content server administrator
Control indexing of parts of a Web page
Control access to files and subdirectories
Ensure that the search appliance can crawl a file system

Using robots.txt to Control Access to a Content Server

The Google Search Appliance and Google Mini always obey the rules in robots.txt and it is not possible to override this feature. However, this type of file is not mandatory. When a robots.txt file present, it is located in the Web server's root directory.

Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content.

If any hosts require authentication before serving robots.txt, you must configure authentication credentials using the Crawl and Index > Crawler Access page in the Admin Console.

Using the Allow Directive

In Google Search Appliance software versions 4.6.4.G.44 and later, the search appliance user agent (gsa-crawler) obeys an extension to the robots.txt standard called " Allow." This extension may not be recognized by all other search engine crawlers, so check with other search engines you're interested in finding out. The Allow directive works exactly like the Disallow directive. Simply list a directory or page you want to allow.

You may want to use Disallow and Allow together. For example, to block access to all pages in a subdirectory except one, use the following entries:

User-Agent: gsa-crawler
    Disallow: /folder1/
    Allow: /folder1/myfile.html

This blocks all pages inside the folder1 directory except for myfile.html.

Using Robots META Tags to Control Access to a Web Page

To prevent the search appliance crawler (as well as other crawlers) from indexing or following links in a specific HTML document, embed a Robots META tag in the head of the document. The search appliance crawler obeys the noindex, nofollow, and noarchive META tags. Refer to the following table for details about Robots META tags, including examples.

Tag Description Example
noindex The search appliance crawler retrieves and archives the document in the search appliance cache, but does not index it. The document is counted as part of the license limit. <META NAME="robots" CONTENT="noindex">
nofollow The search appliance crawler retrieves and archives the document in the search appliance cache, but does not follow links on the Web page to other documents. The document is counted as part of the license limit. <META NAME="robots" CONTENT="nofollow">
noarchive The search appliance crawler retrieves and indexes the document, but does not archive it in its cache. The document is counted as part of the license limit. <META NAME="robots" CONTENT="noarchive">

You can combine any or all of the Robots META tags into a single META tag, for example:

<META NAME="robots" CONTENT="noarchive, nofollow">

Currently, it is not possible to set NAME="gsa-crawler" to limit these restrictions to the search appliance.

If the search encounters a robots META tag when fetching a URL, it schedules a retry after a certain time interval. For URLs excluded by robots META tags, the maximum retry interval is one month.

Back to top

Excluding Unwanted Text from the Index

There may be Web pages that you want to suppress from search results when users search on certain words or phrases. For example, if a Web page consists of the text "the user conference page will be completed as soon as Jim returns from medical leave," you might not want this page to appear in the results of a search on the terms "user conference."

You can prevent this link from being indexed using googleoff/googleon tags. By embedding googleon/googleoff tags with their flags in HTML documents, you can disable:

For details about each googleon/googleoff flag, refer to the following table.

Flag Description Example Results
index Words between the tags are not indexed as occurring on the current page. fish <!--googleoff: index-->shark
<!--googleon: index-->mackerel
The words fish and mackerel are indexed for this page, but the occurrence of shark is not indexed.
This page could appear in search results for the term shark only if the word appears elsewhere on the page or in anchortext for links to the page.
Hyperlinks that appear within these tags are followed.
anchor Anchor text that appears between the tags and in links to other pages is not indexed. This prevents the index from using the hyperlink to associate the link text with the target page in search results. <!--googleoff: anchor--><A href=sharks_rugby.html>
shark </A> <!--googleon: anchor-->
The word shark is not associated with the page sharks_rugby.html. Otherwise this hyperlink would cause the page sharks_rugby.html to appear in the search results for the term shark.
snippet Text between the tags is not used to create snippets for search results. <!--googleoff: snippet-->Come to the fair!
<!--googleon: snippet-->
The text Come to the fair! does not appear in snippets with the search results.
all Turns on all the attributes. Text between the tags is not indexed, followed to another linked-to page, or used for a snippet. <!--googleoff: all-->Come to the fair!
<!--googleon: all-->
The text Come to the fair! is not indexed, is not associated with anchor text, and does not appear in snippets with the search results.

If a URL appears within googleoff and googleon tags, the search appliance crawls the URL.

Using no-crawl Directories to Control Access to Files and Subdirectories

The search appliance does not crawl any directories named "no_crawl." You can prevent the search appliance from crawling files and directories by:

  1. Creating a directory called "no_crawl."
  2. Putting the files and subdirectories you do not want crawled under the no_crawl directory.

This method blocks the search appliance from crawling everything in the no_crawl directory, but it does not provide directory security or block people from accessing the directory.

End users can also use no_crawl directories on their local computers to prevent personal files and directories from being crawled.

Preparing Shared Folders in File Systems

In a Windows network file system, folders and drives can be shared. A shared folder or drive is available for any person, device, or process on the network to use. To enable the search appliance to crawl your file system, do the following:

  1. Set the properties of appropriate folders and drives to "Share this folder."
  2. Check that the content to be crawled is in the appropriate folders and drives.

Ensuring that Unlinked URLs Are Crawled

The search appliance crawls content by following newly discovered links in pages that it crawls. If your enterprise content includes unlinked URLs that are not listed in the follow and crawl patterns, the search appliance crawler will not find them on its own. In addition to adding unlinked URLs to follow and crawl patterns, you can force unlinked URLs into a crawl using one or both of the following types of pages:

Both of these types of pages allow users or crawlers to navigate all the pages within a Web site. To include a jump page or site map in the crawl, add the URL for the page or map to the crawl path.

Back to top

Configuring a Crawl

Before starting a crawl, you must configure the crawl path so that it only includes information that your organization wants to make available in search results. To configure the crawl, use the Crawl and Index > Crawl URLs page in the Admin Console to enter URLs and URL patterns in the following boxes:

Note: URLs are case-sensitive.

For complete information about the Crawl URLs page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.

Start Crawling from the Following URLs

Start URLs control where the search appliance begins crawling your content. The search appliance should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs. Start URLs are required.

Start URLs must be fully qualified URLs in the following format:

<protocol>://<host>{:port}/{path}

The information in the curly brackets is optional.

Typically, start URLs include your company's home site, as shown in the following example:

http://mycompany.com/

Enter start URLs in the Start Crawling from the Following URLs section on the Crawl and Index > Crawl URLs page in the Admin Console. To crawl content from multiple Websites, add start URLs for them.

Follow and Crawl Only URLs with the Following Patterns

Follow and crawl URL patterns control which URLs are crawled and included in the index. Before crawling any URLs, the search appliance checks them against follow and crawl URL patterns. Only URLs that match these URL patterns are crawled and indexed. You must include all start URLs in follow and crawl URL patterns.

The following example shows a follow and crawl URL pattern:

http://www.example.com/help/

Given this follow and crawl URL pattern, the search appliance crawls the following URLs because each one matches it :

http://www.example.com/help/two.html
http://www.example.com/help/three.html

However, the search appliance does not crawl the following URL because it does not match the follow and crawl pattern:

http://www.example.com/us/three.html

The following table provides examples of how to use follow and crawl URL patterns to match sites, directories, and specific URLs.

To Match Expression Format Example
A site <site>/ www.mycompany.com/
URLs from all sites in the same domain <domain>/ mycompany.com/
URLs that are in a specific directory or in one of its subdirectories <site>/<directory>/ sales.mycompany.com/products/
A specific file <site>/<directory>/<file> www.mycompany.com/products/index.html

For more information about writing URL patterns, see Constructing URL Patterns.

Enter follow and crawl URL patterns in the Follow and Crawl Only URLs with the Following Patterns section on the Crawl and Index > Crawl URLs page in the Admin Console.

Do Not Crawl URLs with the Following Patterns

Do not crawl URL patterns exclude URLs from being crawled and included in the index. If a URL contains a do not crawl pattern, the search appliance does not crawl it. Do not crawl patterns are optional.

Enter do not crawl URL patterns in the Do Not Crawl URLs with the Following Patterns section on the Crawl and Index > Crawl URLs page in the Admin Console.

To prevent specific file types, directories, or other sets of pages from being crawled, enter the appropriate URLs in this section. Using this section, you can:

For your convenience, this section is prepopulated with many URL patterns and file types, some of which you may not want the search appliance to index. To make a pattern or file type unavailable to the search appliance crawler, remove the # (comment) mark in the line containing the file type. For example, to make Excel files on your servers unavailable to the crawler, change the line

#.xls$
to
.xls$

Testing Your URL Patterns

To confirm that URLs can be crawled, you can use the Pattern Tester Utility page. This page finds which URLs will be matched by the patterns you have entered for:

To use the Pattern Tester Utility page, click Test these patterns on the Crawl and Index > Crawl URLs page. For complete information about the Pattern Tester Utility page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.

Back to top

About SMB URLs

As when crawling HTTP or HTTPS web-based content, the search appliance uses URLs to refer to individual objects that are available on SMB-based file systems, including files, directories, shares, hosts.

Use the following format for an SMB URL:

smb://string1/string2/...

When the crawler sees a URL in this format, it treats string1 as the hostname and string2 as the share name, with the remainder as the path within the share. Do not enter a workgroup in an SMB URL.

The following example shows a valid SMB URL for crawl:

smb://fileserver.mycompany.com/mysharemydir/mydoc.txt

The following table describes all of the required parts of a URL that are used to identify an SMB-based document.

URL Component Description Example
Protocol Indicates the network protocol that is used to access the object. smb://
Hostname Specifies the DNS host name or WINS name of the SMB server. A hostname can be one of the following:  
A fully qualified domain name fileserver.mycompany.com
An unqualified hostname fileserver
An IP Address 10.0.0.100
Share name Specifies the name of the share to use. A share is tied to a particular host, so two shares with the same name on different hosts do not necessarily contain the same content. myshare
File path Specifies the path to the document, relative to the root share. If myshare on myhost.mycompany.com shares all the documents under the C:\myshare directory, the file C:\myshare\mydir\mydoc.txt is retrieved by the following: smb://myhost.mycompany.com/myshare/
mydir/mydoc.txt
Forward slash SMB URLs use forward slashes only. Some environments, such as Microsoft Windows systems, use backslashes ("\") to separate file path components. Even if you are referring to documents in such an environment, use forward slashes for this purpose. Microsoft Windows style: C:\myshare\
SMB URL: smb://myhost.mycompany.com/myshare/

Unsupported SMB URLs

Some SMB file share implementations allow:

The file system crawler does not support these URL schemes.

SMB URLs for Non-file Objects

SMB URLs can refer to objects other than files, including directories, shares, and hosts. The file system gateway, which interacts with the network file shares, treats these non-document objects like documents that do not have any content, but do have links to certain other objects. The following table describes the correspondence between objects that the URLs can refer to and what they actually link to.

URL Refers To URL Links To Example
Directory Files and subdirectories contained within the directory smb://fileserver.mycompany.com/myshare/mydir/
Share Files and subdirectories contained within the share's top-level directory smb://fileserver.mycompany.com/myshare/
Host Each share on the host. See also "Share Names" in the previous table. smb://fileserver.mycompany.com/

Hostname Resolution

Hostname resolution is the process of associating a symbolic hostname with a numeric address that is used for network routing. For example, the symbolic hostname www.google.com resolves to the numeric address 10.0.0.100.

File system crawling supports two methods of resolving hostnames:

During setup, the search appliance requires that at least one DNS server be specified. If a WINS server is available, you may specify it using the Administration > Network Settings page in the Admin Console.

If both DNS and WINS are configured, the file system gateway first attempts to resolve hostnames used in SMB file shares using WINS. If the hostname is not resolvable (or a WINS server is not configured), the appliance attempts to use DNS to look up the hostname.

WINS is not used to resolve hostnames for non-file-share content and should not be specified when the search appliance will not crawl an SMB file share, or your network does not have a WINS server.

Back to top

Setting Up the Crawler's Access to Secure Content

The information in this document describes crawling public content.

Configuring Searchable Dates

For dates to be properly indexed and searchable by date range, they must be in ISO 8601 format:

YYYY-MM-DD

The following example shows a date in ISO 8601 format:

2007-07-11

For a date in a META tag to be indexed, not only must it be in ISO 8601 format, it must also be the only value in the content. For example, the date in the following META tag can be indexed:

<META name="date" content="2007-07-11">

The date in the following meta tag cannot be indexed because there is additional content:

<META name="date" content="2007-07-11 is a date">

Back to top

Defining Document Date Rules

Documents can have dates explicitly stated in these places:

To define a rule that the search appliance crawler should use to locate document dates in documents for a particular URL, use the Crawl and Index > Document Dates page in the Admin Console. If you define more than one document date rule for a URL, the search appliance applies the rules in the order in which you enter them.

To configure document dates:

  1. Choose Crawl and Index > Document Dates. The Document Dates page appears.
  2. In the Host or URL Pattern box, enter the host or URL pattern for which you want to set the rule.
  3. Use the Locate Date In drop-down list to select the location of the date for the document in the specified URL pattern.
  4. If you select Meta Tag, specify the name of the tag in the Meta Tag Name box. Make sure that you find a META tag in your HTML. For example, for the tag <META name="publication_date">, enter "publication_date" in the Meta Tag Name box.
  5. To add another date rule, click Add More Lines, and add the rule.
  6. Click Save Changes. This triggers the Documents Dates process to run.

For complete information about the Document Dates page, click Help Center > Crawl and Index > Document Dates in the Admin Console.

Back to top

Last modified:

Updated on