My favorites | English | Sign in

Faster apps faster - GWT 2.0 with Speed Tracer New!

Google Search Appliance

Administering Crawl for Web and File Share Content: Preparing for a Crawl

Google Search Appliance software version 6.0
Posted June 2009

Crawling is the process where the Google Search Appliance discovers enterprise content to index. This chapter tells search appliance administrators and content owners how to prepare enterprise content for crawling.

Contents

  1. Preparing Data for Crawl
    1. Using robots.txt to Control Access to a Content Server
      1. Using the Allow Directive
    2. Using Robots META Tags to Control Access to a Web Page
    3. Excluding Unwanted Text from the Index
    4. Using no_crawl Directories to Control Access to Files and Subdirectories
    5. Preparing Shared Folders in File Systems
    6. Ensuring that Unlinked URLs Are Crawled
  2. Configuring a Crawl
    1. Start Crawling from the Following URLs
    2. Follow and Crawl Only URLs with the Following Patterns
    3. Do Not Crawl URLs with the Following Patterns
    4. Testing Your URL Patterns
    5. Using Google Regular Expressions as Crawl Patterns
    6. Configuring Database Crawl
  3. About SMB URLs
    1. Unsupported SMB URLs
    2. SMB URLs for Non-file Objects
    3. Hostname Resolution
  4. Setting Up the Crawler's Access to Secure Content
  5. Configuring Searchable Dates
  6. Defining Document Date Rules

Back to top

Preparing Data for a Crawl

Before the Google Search Appliance crawls your enterprise content, people in various roles may want to prepare the content to meet the objectives described in the following table.

Objective Role
Control access to a content server Content server administrator, webmaster
Control access to a Web page Search appliance administrator, webmaster, content owner, and/or content server administrator
Control indexing of parts of a Web page
Control access to files and subdirectories
Ensure that the search appliance can crawl a file system

Using robots.txt to Control Access to a Content Server

The Google Search Appliance always obeys the rules in robots.txt and it is not possible to override this feature. However, this type of file is not mandatory. When a robots.txt file present, it is located in the Web server's root directory. For the search appliance to be able to access the robot.txt file, the file must be public.

Before the search appliance crawls any content servers in your environment, check with the content server administrator or webmaster to ensure that robots.txt allows the search appliance user agent access to the appropriate content. For the search appliance to be able to access to the robot.txt file, the file must be public.

If any hosts require authentication before serving robots.txt, you must configure authentication credentials using the Crawl and Index > Crawler Access page in the Admin Console.

Using the Allow Directive

In Google Search Appliance software versions 4.6.4.G.44 and later, the search appliance user agent (gsa-crawler) obeys an extension to the robots.txt standard called " Allow." This extension may not be recognized by all other search engine crawlers, so check with other search engines you're interested in finding out. The Allow directive works exactly like the Disallow directive. Simply list a directory or page you want to allow.

You may want to use Disallow and Allow together. For example, to block access to all pages in a subdirectory except one, use the following entries:

User-Agent: gsa-crawler
    Disallow: /folder1/
    Allow: /folder1/myfile.html

This blocks all pages inside the folder1 directory except for myfile.html.

Using Robots META Tags to Control Access to a Web Page

To prevent the search appliance crawler (as well as other crawlers) from indexing or following links in a specific HTML document, embed a Robots META tag in the head of the document. The search appliance crawler obeys the noindex, nofollow, and noarchive META tags. Refer to the following table for details about Robots META tags, including examples.

Tag Description Example
noindex The search appliance crawler does not archive the document in the search appliance cache or index it. The document is not counted as part of the license limit. <META NAME="robots" CONTENT="noindex">
nofollow The search appliance crawler retrieves and archives the document in the search appliance cache, but does not follow links on the Web page to other documents. The document is counted as part of the license limit. <META NAME="robots" CONTENT="nofollow">
noarchive The search appliance crawler retrieves and indexes the document, but does not archive it in its cache. The document is counted as part of the license limit. <META NAME="robots" CONTENT="noarchive">

You can combine any or all of the Robots META tags into a single META tag, for example:

<META NAME="robots" CONTENT="noarchive, nofollow">

Currently, it is not possible to set NAME="gsa-crawler" to limit these restrictions to the search appliance.

If the search encounters a robots META tag when fetching a URL, it schedules a retry after a certain time interval. For URLs excluded by robots META tags, the maximum retry interval is one month.

Back to top

Excluding Unwanted Text from the Index

There may be Web pages that you want to suppress from search results when users search on certain words or phrases. For example, if a Web page consists of the text "the user conference page will be completed as soon as Jim returns from medical leave," you might not want this page to appear in the results of a search on the terms "user conference."

You can prevent this link from being indexed using googleoff/googleon tags. By embedding googleon/googleoff tags with their flags in HTML documents, you can disable:

  • The indexing of a word or portion of a Web page
  • The indexing of anchor text
  • The use of text to create a snippet in search results

For details about each googleon/googleoff flag, refer to the following table.

Flag Description Example Results
index Words between the tags are not indexed as occurring on the current page. fish <!--googleoff: index-->shark
<!--googleon: index-->mackerel
The words fish and mackerel are indexed for this page, but the occurrence of shark is not indexed.
This page could appear in search results for the term shark only if the word appears elsewhere on the page or in anchortext for links to the page.
Hyperlinks that appear within these tags are followed.
anchor Anchor text that appears between the tags and in links to other pages is not indexed. This prevents the index from using the hyperlink to associate the link text with the target page in search results. <!--googleoff: anchor--><A href=sharks_rugby.html>
shark </A> <!--googleon: anchor-->
The word shark is not associated with the page sharks_rugby.html. Otherwise this hyperlink would cause the page sharks_rugby.html to appear in the search results for the term shark. Hyperlinks that appear within these tags are followed, so sharks_rugby.html is still crawled and indexed.
snippet Text between the tags is not used to create snippets for search results. <!--googleoff: snippet-->Come to the fair!
<!--googleon: snippet-->
The text Come to the fair! does not appear in snippets with the search results.
all Turns on all the attributes. Text between the tags is not indexed, followed to another linked-to page, or used for a snippet. <!--googleoff: all-->Come to the fair!
<!--googleon: all-->
The text Come to the fair! is not indexed, is not associated with anchor text, and does not appear in snippets with the search results.

There must be a space or newline before the googleon tag. If a URL appears within googleoff and googleon tags, the search appliance crawls the URL.

Using no-crawl Directories to Control Access to Files and Subdirectories

The Google Search Appliance does not crawl any directories named "no_crawl." You can prevent the search appliance from crawling files and directories by:

  1. Creating a directory called "no_crawl."
  2. Putting the files and subdirectories you do not want crawled under the no_crawl directory.

This method blocks the search appliance from crawling everything in the no_crawl directory, but it does not provide directory security or block people from accessing the directory.

End users can also use no_crawl directories on their local computers to prevent personal files and directories from being crawled.

Preparing Shared Folders in File Systems

In a Windows network file system, folders and drives can be shared. A shared folder or drive is available for any person, device, or process on the network to use. To enable the Google Search Appliance to crawl your file system, do the following:

  1. Set the properties of appropriate folders and drives to "Share this folder."
  2. Check that the content to be crawled is in the appropriate folders and drives.

Ensuring that Unlinked URLs Are Crawled

The Google Search Appliance crawls content by following newly discovered links in pages that it crawls. If your enterprise content includes unlinked URLs that are not listed in the follow and crawl patterns, the search appliance crawler will not find them on its own. In addition to adding unlinked URLs to follow and crawl patterns, you can force unlinked URLs into a crawl using one or both of the following types of pages:

  • Jump page - A jump page lists any URLs and links that you want the search appliance crawl to discover.
  • Site map - A site map lists the pages on a Web site, typically organized in hierarchical fashion.

Both of these types of pages allow users or crawlers to navigate all the pages within a Web site. To include a jump page or site map in the crawl, add the URL for the page or map to the crawl path.

Back to top

Configuring a Crawl

Before starting a crawl, you must configure the crawl path so that it only includes information that your organization wants to make available in search results. To configure the crawl, use the Crawl and Index > Crawl URLs page in the Admin Console to enter URLs and URL patterns in the following boxes:

  • Start Crawling from the Following URLs
  • Follow and Crawl Only URLs with the Following Patterns
  • Do Not Crawl URLs with the Following Patterns

Note: URLs are case-sensitive.

If the search should never crawl outside of your intranet site, then Google recommends that you take one or more of the following actions:

  • Configure your network to disallow search appliance connectivity outside of your intranet.

    If you want to make sure that the search appliance never crawls outside of your intranet, then a person in your IT/IS group needs to specifically block the search appliance IP addresses from leaving your intranet. The GB-5005 and GB-8008 use three IP addresses, and these IP addresses are in your DNS entries as: googleswitch, googleweb, and googlecrawl. The GB-1001 uses only googleweb. Your IT/IS group needs to configure either an Access Control List (ACL) on your external routers or a set of rules on your firewall to disallow any communication between these IP addresses and the outside world.

  • Make sure all patterns in the field Follow and Crawl Only URLs with the Following Patterns specify yourcompany.com as the domain name.
  • For complete information about the Crawl URLs page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.

    Start Crawling from the Following URLs

    Start URLs control where the Google Search Appliance begins crawling your content. The search appliance should be able to reach all content that you want to include in a particular crawl by following the links from one or more of the start URLs. Start URLs are required.

    Start URLs must be fully qualified URLs in the following format:

    <protocol>://<host>{:port}/{path}

    The information in the curly brackets is optional. The forward slash "/" after <host>{:port} is required.

    Typically, start URLs include your company's home site, as shown in the following example:

    http://mycompany.com/

    The following example shows a valid start URL:

    http://www.example.com/help/

    The following table contains examples of invalid URLs

    Invalid examples

    Reason:

    http://www/

    Invalid because the hostname is not fully qualified. A fully qualified hostname includes the local hostname and the full domain name. For example: mail.corp.company.com.

    www.example.com/

    Invalid because the protocol information is missing.

    http://www.example.com

    The "/" after <host>[:port] is required.

    The search appliance attempts to resolve incomplete path information entered, using the information entered on the Administration > Network Settings page in the DNS Suffix (DNS Search Path) section. However, if it cannot be successfully resolved, the following error message displays in red on the page:
    You have entered one or more invalid start URLs. Please check your edits.

    The crawler will retry several times to crawl URLs that are temporarily unreachable.

    These URLs are only the starting point(s) for the crawl. They tell the crawler where to begin crawling. However, links from the start URLs will be followed and indexed only if they match a pattern in Follow and Crawl Only URLs with the Following Patterns. For example, if you specify a starting URL of http://mycompany.com/ in this section and a pattern www.mycompany.com/ in the Follow and Crawl Only URLs with the Following Patterns section, the crawler will discover links in the http://www.mycompany.com/ web page, but will only crawl and index URLs that match the pattern www.mycompany.com/.

    Enter start URLs in the Start Crawling from the Following URLs section on the Crawl and Index > Crawl URLs page in the Admin Console. To crawl content from multiple Websites, add start URLs for them.

    Follow and Crawl Only URLs with the Following Patterns

    Follow and crawl URL patterns control which URLs are crawled and included in the index. Before crawling any URLs, the Google Search Appliance checks them against follow and crawl URL patterns. Only URLs that match these URL patterns are crawled and indexed. You must include all start URLs in follow and crawl URL patterns.

    The following example shows a follow and crawl URL pattern:

    http://www.example.com/help/

    Given this follow and crawl URL pattern, the search appliance crawls the following URLs because each one matches it:

    http://www.example.com/help/two.html
    http://www.example.com/help/three.html

    However, the search appliance does not crawl the following URL because it does not match the follow and crawl pattern:

    http://www.example.com/us/three.html

    The following table provides examples of how to use follow and crawl URL patterns to match sites, directories, and specific URLs.

    To Match Expression Format Example
    A site <site>/ www.mycompany.com/
    URLs from all sites in the same domain <domain>/ mycompany.com/
    URLs that are in a specific directory or in one of its subdirectories <site>/<directory>/ sales.mycompany.com/products/
    A specific file <site>/<directory>/<file> www.mycompany.com/products/index.html

    For more information about writing URL patterns, see Constructing URL Patterns.

    Enter follow and crawl URL patterns in the Follow and Crawl Only URLs with the Following Patterns section on the Crawl and Index > Crawl URLs page in the Admin Console.

    Do Not Crawl URLs with the Following Patterns

    Do not crawl URL patterns exclude URLs from being crawled and included in the index. If a URL contains a do not crawl pattern, the Google Search Appliance does not crawl it. Do not crawl patterns are optional.

    Enter do not crawl URL patterns in the Do Not Crawl URLs with the Following Patterns section on the Crawl and Index > Crawl URLs page in the Admin Console.

    To prevent specific file types, directories, or other sets of pages from being crawled, enter the appropriate URLs in this section. Using this section, you can:

    • Prevent certain URLs, such as email links, from consuming your license limit.
    • Protect files that you do not want people to see.
    • Save time while crawling by eliminating searches for objects such as MP3 files.

    For your convenience, this section is prepopulated with many URL patterns and file types, some of which you may not want the search appliance to index. To make a pattern or file type unavailable to the search appliance crawler, remove the # (comment) mark in the line containing the file type. For example, to make Excel files on your servers unavailable to the crawler, change the line

    #.xls$ 

    to

    .xls$ 

    Testing Your URL Patterns

    To confirm that URLs can be crawled, you can use the Pattern Tester Utility page. This page finds which URLs will be matched by the patterns you have entered for:

    • Follow and Crawl Only URLs with the Following Patterns
    • Do Not Crawl URLs with the Following Patterns

    To use the Pattern Tester Utility page, click Test these patterns on the Crawl and Index > Crawl URLs page. For complete information about the Pattern Tester Utility page, click Help Center > Crawl and Index > Crawl URLs in the Admin Console.

    Using Google Regular Expressions as Crawl Patterns

    The search appliance's Admin Console accepts Google regular expressions (similar to GNU regular expressions) as crawl patterns, but not all of these are valid in the Robots Exclusion Protocol. Therefore, the Admin Console does not accept Robots Exclusion Protocol patterns that are not valid Google regular expressions. Similarly, Google or GNU regular expressions cannot be used in robots.txt unless they are valid under the Robots Exclusion Protocol.

    Here are some examples:

    • The asterisk (*) is a valid wildcard character in both GNU regular expressions and the Robots Exclusion Protocol, and can be used in the Admin Console or in robots.txt.
    • The $ and ^ characters indicate the end or begining of a string, respectively, in GNU regular expressions, and can be used in the Admin Console. They are not valid delimiters for a string in the Robots Exclusions Protocol, however, and cannot be used as anchors in robots.txt.
    • The "Disallow" directive is used in robots.txt to indicate that a resource should not be visited by web crawlers. However, "Disallow" is not a valid directive in Google or GNU regular expressions, and cannot be used in the Admin Console.

    Configuring Database Crawl

    To configure a database crawl, provide database data source information by using the Create New Database Source section on the Crawl and Index > Databases page in the Admin Console. To navigate to this page, click Crawl and Index > Databases.

    For information about configuring a database crawl, refer to Providing Database Data Source Information in Database Crawling and Serving.

    Back to top

    About SMB URLs

    As when crawling HTTP or HTTPS web-based content, the Google Search Appliance uses URLs to refer to individual objects that are available on SMB-based file systems, including files, directories, shares, hosts.

    Use the following format for an SMB URL:

    smb://string1/string2/...

    When the crawler sees a URL in this format, it treats string1 as the hostname and string2 as the share name, with the remainder as the path within the share. Do not enter a workgroup in an SMB URL.

    The following example shows a valid SMB URL for crawl:

    smb://fileserver.mycompany.com/mysharemydir/mydoc.txt

    The following table describes all of the required parts of a URL that are used to identify an SMB-based document.

    URL Component Description Example
    Protocol Indicates the network protocol that is used to access the object. smb://
    Hostname Specifies the DNS host name. A hostname can be one of the following:  
    A fully qualified domain name fileserver.mycompany.com
    An unqualified hostname fileserver
    An IP Address 10.0.0.100
    Share name Specifies the name of the share to use. A share is tied to a particular host, so two shares with the same name on different hosts do not necessarily contain the same content. myshare
    File path Specifies the path to the document, relative to the root share. If myshare on myhost.mycompany.com shares all the documents under the C:\myshare directory, the file C:\myshare\mydir\mydoc.txt is retrieved by the following: smb://myhost.mycompany.com/myshare/
    mydir/mydoc.txt
    Forward slash SMB URLs use forward slashes only. Some environments, such as Microsoft Windows systems, use backslashes ("\") to separate file path components. Even if you are referring to documents in such an environment, use forward slashes for this purpose. Microsoft Windows style: C:\myshare\
    SMB URL: smb://myhost.mycompany.com/myshare/

    Unsupported SMB URLs

    Some SMB file share implementations allow:

    • URLs that omit the hostname
    • URLs with workgroup identifiers in place of hostnames

    The file system crawler does not support these URL schemes.

    SMB URLs for Non-file Objects

    SMB URLs can refer to objects other than files, including directories, shares, and hosts. The file system gateway, which interacts with the network file shares, treats these non-document objects like documents that do not have any content, but do have links to certain other objects. The following table describes the correspondence between objects that the URLs can refer to and what they actually link to.

    URL Refers To URL Links To Example
    Directory Files and subdirectories contained within the directory smb://fileserver.mycompany.com/myshare/mydir/
    Share Files and subdirectories contained within the share's top-level directory smb://fileserver.mycompany.com/myshare/

    Hostname Resolution

    Hostname resolution is the process of associating a symbolic hostname with a numeric address that is used for network routing. For example, the symbolic hostname www.google.com resolves to the numeric address 10.0.0.100.

    File system crawling supports Domain Name Services (DNS), the standard name resolution method used by the Internet; it may not cover an internal network. During setup, the search appliance requires that at least one DNS server be specified. When crawling a host a search appliance will perform a DNS request if 30 minutes have passed since the previous request.

    Back to top

    Setting Up the Crawler's Access to Secure Content

    The information in this document describes crawling public content. For information about setting up the crawler's access to secure content, see Managing Search for Controlled-Access Content.

    Configuring Searchable Dates

    For dates to be properly indexed and searchable by date range, they must be in ISO 8601 format:

    YYYY-MM-DD

    The following example shows a date in ISO 8601 format:

    2007-07-11

    For a date in a META tag to be indexed, not only must it be in ISO 8601 format, it must also be the only value in the content. For example, the date in the following META tag can be indexed:

    <META name="date" content="2007-07-11">

    The date in the following meta tag cannot be indexed because there is additional content:

    <META name="date" content="2007-07-11 is a date">

    Back to top

    Defining Document Date Rules

    Documents can have dates explicitly stated in these places:

    • URL
    • Title
    • Body of the document
    • META tags of the document
    • Last-modified date from the HTTP response

    To define a rule that the search appliance crawler should use to locate document dates in documents for a particular URL, use the Crawl and Index > Document Dates page in the Admin Console. If you define more than one document date rule for a URL, the search appliance applies the rules in the order in which you enter them.

    To configure document dates:

    1. Choose Crawl and Index > Document Dates. The Document Dates page appears.
    2. In the Host or URL Pattern box, enter the host or URL pattern for which you want to set the rule.
    3. Use the Locate Date In drop-down list to select the location of the date for the document in the specified URL pattern.
    4. If you select Meta Tag, specify the name of the tag in the Meta Tag Name box. Make sure that you find a META tag in your HTML. For example, for the tag <META name="publication_date">, enter "publication_date" in the Meta Tag Name box.
    5. To add another date rule, click Add More Lines, and add the rule.
    6. Click Save Changes. This triggers the Documents Dates process to run.

    For complete information about the Document Dates page, click Help Center > Crawl and Index > Document Dates in the Admin Console.

    Back to top