Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Cookie Sites
  Forms Authentication
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Index Rollback
  Freshness Tuning
  Collections

Serving

Status and Reports

Administration

More Information

Crawl and Index > Crawler Access

On the Crawl and Index > Crawler Access page, you configure how the crawler accesses content servers that require authentication before granting access to confidential content.

Crawl and Serve Secure Content

You can index and serve results on content that is protected by authentication mechanisms (HTTP Basic Authorization and NTLM) for content that resides on a protected web server or a protected file share. You must create an authentication rule instructing the crawler how to authenticate when crawling the protected content. An authentication rule consists of a URL pattern matching the protected files, username, domain (if using NTLM), and password. Using the Make Public checkbox, you can allow users to get results on both the public content (normally available to everyone) and the secure (confidential) content.

To set options for crawling secure content:

Have ready the URLs (or matching patterns), the domain used by the web server, and the user names and passwords.

  1. Click Crawl and Index, and then click the Crawler Access link.
  2. Under Users and Passwords for Crawling, enter the URLs Matching Pattern, the username, the domain (if using NTLM), and the password and confirmation in the text boxes.
  3. If you need more rows for additional patterns, click the Add More Rows button.
  4. Click the Save Crawler Access Configuration button.

Example:

This example shows how to configure the crawler to authenticate to various servers:

  • as a member of a Windows domain via NTLM (the first four entries), or
  • using a simple username/password combination for Basic Authentication (last three entries).
For URLs Matching Pattern, Use: Username: In Domain: Password: Confirm Password:
https://www.mycompany.com/secure/ crawler mycompany ****** ******
https://www.mycompany.com/robots.txt crawler mycompany ****** ******
smb://fileshare.mycompany.com/ crawler mycompany ****** ******
https://designdocs.mycompany.com/ JohnDoe   ****** ******
smb://engserver.mycompany.com/ JohnDoe   ****** ******

The usernames listed in the table above are created by an authentication administrator, not by the search appliance.

Important: The entries you make in the Users and Passwords for Crawling section are sequential rules. Always enter more specific rules before general rules. For example, first enter

    http://corp.mycompany.com/secure/

followed by

    http://corp.mycompany.com/

If incorrect access information or credentials are entered here, then the retrieval or exclusion errors will appear in the Status and Reports > Crawl Diagnostics page of the Admin Console.

Access to 'robots.txt' File

If a web server is configured to require authentication for all HTTP or HTTPS requests, be sure to create an authentication rule with a pattern that matches the '/robots.txt' file.

In order to obey the Robots Exclusion Protocol, the crawler attempts to retrieve /robots.txt. If the attempt results in an HTTP 401 (authentication required) response code, the crawler will be unable to crawl any other URLs on the site. If the attempt to access /robots.txt results in HTTP 200 (success) or HTTP 404 (not found) response codes, the crawl can proceed to the content of that HTTP or HTTPS site. 

If a site requires authentication for all requests and an authentication rule matching /robots.txt does not exist, the crawler will receive a HTTP 401 response code and will be unable to crawl any other URLs on the site.

In the above example, we've created a second authentication rule matching /robots.txt for the www.mycompany.com web server, since the first rule matches only URLs in the /secure/ directory.

File System Crawling and Serving

Documents located on SMB (Server Message Block) file shares are indexed and served in public search results. When a search result contains a document from a file share, the document will be served through the search appliance and be available to all search users.

If access to the file share requires authentication, be sure to include the file share's URL pattern in the crawler access configuration. You must convert URL pattern paths to the supported format. Windows uses the UNC (Universal Naming Convention) format to express SMB file share locations, but this syntax is not supported by the search appliance. For example, this UNC location cannot be entered as a Start URL:

     \\file-server\folder\file.txt

To convert a UNC path to a URI suitable for use as an authentication pattern, prepend "smb:" to the UNC path, then convert all backslashes to forward slashes. For example, the UNC path listed above becomes:

     smb://file-server/folder/file.txt

By default, if no authentication is specified on the Crawler Access page, the appliance crawls the file shares that are listed in the Start and Follow URLs as a "Guest" user. To successfully crawl the files as a Guest, the SMB server must be configured to allow access to "Guest" and/or "Everyone." Set up these account names in English, even if you normally use another language.

Note: If your environment uses a WINS server to look up hostnames, you must also configure the crawler to use this WINS server in the Administration > Network Settings page.

About Secure Search Results

The search appliance can serve results over both plaintext HTTP as well as encrypted HTTPS.

When secure content results are displayed, the total number of results and number of pages returned is hidden to prevent exposing information about secure documents to users who do not have access.

Although there is no overload on secure servers at crawl time, a search request will add some load to servers containing secure content.

In the Search Box section of Page Layout, you can add option buttons to your search page that let your users decide to search on public content or on the complete index (both public and secure content) at the time of their search.

A query against public and secure content requires that the user be authenticated by entering the username and password for the secure area. If your servers require a domain name for authentication, users should enter it like this: domain/username. If a user enters an incorrect username or password, no secure results will be included in the search results.


 

 
© Google Inc. 2007