Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Cookie Sites
  Forms Authentication
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Index Rollback
  Freshness Tuning
  Collections

Serving

Status and Reports

Administration

More Information

Crawl and Index > Crawl URLs

Before you begin crawling your web content, you must specify one or more starting locations. You can control and refine the breadth of the crawl by specifying URL patterns to follow and others to avoid. For a given URL to be crawled, it must match at least one URL pattern in the Follow and Crawl Only URLs with the Following Patterns box and none of the URL patterns in the Do Not Crawl URLs with the Following Patterns box.

Note: If a URL is matched by patterns from both Follow and Crawl Only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns, the URL will not be crawled.

URLs are case sensitive. If you want case-insensitive URLs, use the operator regexpIgnoreCase.

The crawler can access content over HTTP, HTTPS, and SMB protocols. More information about file system crawling using SMB appears below.

The following options let you control and refine your crawls.

Start Crawling from the Following URLs

Starting URLs, entered one per line, control where the crawl begins. All content that you wish to include in all of the collections should be reachable by following links from one or more documents listed in the starting URLs.

These URLs are only the starting point(s) for the crawl. They tell the crawler where to begin crawling. However, links from the start URLs will be followed and indexed only if they match a pattern in Follow and Crawl Only URLs with the Following Patterns. For example, if you specify a starting URL of http://mycompany.com/ in this section and a pattern www.mycompany.com/ in the Follow and Crawl Only URLs with the Following Patterns section, the crawler will discover links in the http://www.mycompany.com/ web page, but will only crawl and index URLs that match the pattern www.mycompany.com/.

All entries in this window must be fully qualified URLs, using the following format:

<protocol>://<host>[:port]/[path]

In this format, the protocol can include HTTP, HTTPS (for secure content) or SMB (for fileshares).
The information contained in square brackets [ ] is optional. The forward slash "/" after <host>[:port] is required.

Valid examples:
https://www.example.com/secure/
http://www.example.com/help/
smb://fileshare.mycompany.com/
your-sharename/

Invalid examples:

 

Reason:

http://www/

 

Invalid because the hostname is not fully qualified. A fully qualified hostname includes the local hostname and the full domain name. For example: mail.corp.company.com.

www.example.com/

 

Invalid because the protocol information is missing.

http://www.example.com

 

The "/" after <host>[:port] is required.

To enter a new URL, type a valid entry into the window. Press Enter to add additional URLs, one per line.

Note: This window must contain at least one start URL. The search appliance will attempt to resolve incomplete path information entered, using the information entered on the Administration > Network Settings page in the DNS Suffix (DNS Search Path) section. However, if it cannot be successfully resolved, the following error message displays in red on the page:
You have entered one or more invalid start URLs. Please check your edits.

The crawler will retry several times to crawl URLs that are temporarily unreachable.

File System Crawling
SMB (Server Message Block) file shares, sometimes known as "Windows File Shares," are a commonly used network file system. To crawl documents stored in an SMB file share, enter a URI using the smb: protocol, using the following format:

    smb://file-server/your-sharename/folder/

Do not start the crawl at the top-level SMB URL. For example, this is an invalid URL:

    smb://

These files will be indexed and served in public search results. Results links for documents located on SMB fileshares will be served through the search appliance and be available to all search users.

Note: If your environment uses a WINS server to look up hostnames, you must also configure the crawler to use this WINS server in the Administration > Network Settings page.

Follow and Crawl Only URLs with the Following Patterns

All entries in Start Crawling from the Following URLs box require a corresponding entry in the Follow and Crawl Only URLs with the Following Patterns box, or an error message will display.

Only URLs matching the patterns you specify (one per line) in this window will be followed and crawled. This allows you to control which files will be crawled on your server.

Example:

https://www.example.com/secure/
http://www.example.com:80/help/
smb://fileshare.mycompany.com/
my-sharename/

These entries limits the crawl to URLs containing the above strings. For instance, all of the following would be crawled (presuming they are not included in the Do Not Crawl URLs):

https://www.example.com/secure/file.txt
http://www.example.com:80/help/projectA
smb://fileshare.mycompany.com/my-sharename/folder1

The URLs that are discovered are checked against these patterns for inclusion in the index. Only URLs that match these patterns are crawled and indexed. In order for a URL to be crawled and indexed, there must be a sequence of links matching the Follow patterns from one of the Starting URLs. If there is no valid link path, you should add the URL to the Start Crawling from the Following URLs section.

The URL patterns you list in this window must conform to the rules for valid URL patterns. To enter a URL pattern, type a valid pattern into the window. Press Enter to add additional patterns. Empty lines and comments (starting with #) are permitted.

URLs on the Crawl URLs page are case sensitive. If you want case-insensitive URL pattern matching, use the operator regexpIgnoreCase. If you want case-insensitive URL pattern matching, use the operator regexpIgnoreCase . For example, suppose you enter the following pattern:

regexpIgnoreCase:http://www.mycompany.com/documents/

That pattern would also match the following URLs:

http://www.mycompany.com/Documents/
http://www.mycompany.com/DOCUMENTS/

Test These Patterns

To test which URLs will be matched by one of the patterns you have entered in this field, click either of the Test these patterns links to open the Pattern Tester Utility. This Utility lets you specify a list of URLs on the left and a set of patterns on the right. It tells you if each URL is matched by one of the patterns in the set.

When it opens, the Pattern Tester Utility is initialized with your saved entries from the Crawl and Index > Crawl URLs page. You can enter more URLs and patterns into the tester utility to best analyze your pattern sets. However, your modifications will not be saved; you have to explicitly enter and save them in the Crawl and Index > Crawl URLs page.

After you click the Test These Patterns button, the results appear on the same page. The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL.

Click the Back to Crawl and Index > Crawl URLs link to return to the Crawl and Index > Crawl URLs page.

Do Not Crawl URLs with the Following Patterns

Any pure text in a document is extracted and indexed by a file type search. Graphics, diagrams, and formatting information are not indexed. You can exclude any particular file format from being crawled and indexed by defining URL pattern exceptions to prevent crawling from occurring on those pages. URLs matching the patterns you specify (one per line) in this window will not be crawled.

This option allows you to prevent specific file types, directories, or other sets of pages from being crawled. For example, entering the pattern contains:? in this box will prevent many Common Gateway Interface (CGI) scripts from being crawled.

The URL patterns you list here must conform to the rules for valid URL patterns. To enter a URL pattern, type a valid pattern into the window. Press Enter to add additional patterns on new lines. Empty lines and comments (starting with #) are permitted.

For your convenience, this box is prepopulated with many URL patterns and file types, some of which you may not want the crawler to index. We do not recommend deleting any of the default patterns unless you detect parts of your site that are currently being excluded by these rules.

To make a pattern or file type unavailable to the crawler, remove the # mark in the line containing the file type. For example, to make Excel files on your servers unavailable to the crawler, change the line

#.xls$
to
.xls$

Test These Patterns

To test the patterns you have entered, click one of the Test these patterns links. When it opens, the Pattern Tester Utility is initialized with your saved entries from the Crawl and Index > Crawl URLs page. You can enter more URLs and patterns into the tester utility to best analyze your pattern sets. However, your modifications will not be saved; you have to explicitly enter and save them in the Crawl and Index > Crawl URLs page. After you click the Test These Patterns button, the results appear on the same page. The green background indicates that at least one of the patterns does match the URLs you want to crawl. It also shows the first pattern that matched. The red background shows that none of the patterns matched this URL.

Click the Back to Crawl and Index > Crawl URLs link to return to the Crawl and Index > Crawl URLs page.

Note: If the search should never crawl outside of your intranet site, then we recommend that you do one or more of the following:

  • Configure your network to disallow search appliance connectivity outside of your intranet.

    If you want to make sure that the search appliance never crawls outside of your intranet, then a person in your IT/IS group needs to specifically block the search appliance IP addresses from leaving your intranet. The GB-5005 and GB-8008 use three IP addresses, and these IP addresses are in your DNS entries as: googleswitch, googleweb, and googlecrawl. The GB-1001 uses only googleweb. Your IT/IS group needs to configure either an Access Control List (ACL) on your external routers or a set of rules on your firewall to disallow any communication between these IP addresses and the outside world.

  • Make sure all patterns in the field Follow and Crawl Only URLs with the Following Patterns specify yourcompany.com as the domain name.

 
© Google Inc. 2007