Google Search Appliance software version 6.0
Posted June 2009
A URL pattern is a set of ordered characters to which the Google Search Appliance or Google Mini matches actual URLs that the crawler discovers. You can specify URL patterns for which your index should include matching URLs and URL patterns for which your index should exclude matching URLs. This document explains how to construct a URL pattern.
A URL pattern is a set of ordered characters that is modeled after an actual URL. The URL pattern is used to match one or more specific URLs. An exception pattern starts with a hyphen (-).
URL patterns specified in the Crawl URLs page control the URLs that the search appliance includes in the index. To configure the crawl, use the Crawl and Index > Crawl URLs page in the Admin Console to enter URLs and URL patterns in the following boxes:
The search appliance starts crawling from the URLs listed in the Start Crawling from the Following URLs text box. Each URL that the search appliance encounters is compared with URL patterns listed in the Follow and Crawl Only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns text boxes.
A URL is included in the index when all of the following are true:
Alternatively, URLs can be excluded from an index through the use of a robots.txt file
or robots meta tags.
For complete information about the Crawl URLs page, in the Admin Console, click Help Center > Crawl and Index > Crawl URLs.
When specifying the URLs that should or should not be crawled on your site or when building URL-based collections, your URLs must conform to the valid patterns listed in the table that follows.
| Valid URL Patterns | Examples | Explanation |
|---|---|---|
| Any substring of a URL that includes the host/path separating slash |
http://www.google.com/
|
Any page on www.google.com using the HTTP protocol. |
www.google.com/ |
Any page on www.google.com using any supported protocol. |
|
google.com/ |
Any page in the google.com domain. |
|
Any suffix of a string. You specify the suffix with the $ at the end of the string.
|
home.html$
|
All pages ending with home.html.
|
.pdf$
|
All pages with the extension .pdf.
|
|
| Any prefix of a string. You specify the prefix with the ^ at the beginning of the string. A prefix can be used in combination with the suffix for exact string matches. For example, ^candy cane$ matches the exact string for "candy cane." |
^http://
|
Any page using the HTTP protocol. |
^https://
|
Any page using the HTTPS protocol. | |
^http://www.google.com/
|
Only the specified page. | |
| An arbitrary substring of a URL. These patterns are specified using the prefix "contains". |
contains:coffee
|
Any URL that contains "coffee." |
contains:beans.com
|
Any URL that contains " |
|
Exceptions denoted by - (minus) sign.
|
candy.com/-www.candy.com/
|
Means that "www.chocolate.candy.com" is a match,
but "www.candy.com" is not a match.
|
|
Regular expressions from the GNU Regular Expression library.
In the search appliance,
regular expressions: (1) Are case sensitive unless you specify regexpIgnoreCase:(2) Must use one escape character (backslashes "\") when reserved characters are added to the regular expression. Note: regexp: and regexpCase: are equivalent.
|
(Wrapped for readability)
regexp:-sid=[0-9A-Z]+/ |
See the GNU Regular Expression library. |
| Comments |
#this is a comment
|
Empty lines and comments starting with # are permissible. These comments are removed from the URL pattern and ignored.
|
A line that starts with a # (pound) character is treated as a comment, as shown in the following example.
#This is a comment.
URL patterns are case sensitive. The following table uses www.example.com/ to illustrate an example that does not
match the URL pattern, and another example that does match the pattern.
| URL Pattern | www.example.com |
|---|---|
| Invalid URL | http://www.EXAMPLE.com/mypage.html |
| Matching URL | http://www.example.com/mypage.html |
The Google Search Appliance and Google Mini treats URLs as case-sensitive, because URLs that differ only by case can legitimately be different pages. To capture URLs with variable case use a regular expression. More information about regular expressions, see the Google Regular Expressions section of this document.
The following notation is used throughout this document:
| Format | <site>/ |
|---|---|
| Example | www.example.com/ |
To match URLs from all sites in the same domain, specify the domain name. The
following example matches all sites in the domain example.com.
| Format | <domain>/ |
|---|---|
| Example | example.com/ |
| Matching URLs | www.example.com |
To describe URLs that are in a specific directory or in one of its sub-directories, specify the directory and any sub-directory in the pattern.
The following example matches all URLs in the products directory
and all sub-directories under products on the site sales.example.com.
| Format | <site>/<directory>/ |
|---|---|
| Example | sales.example.com/products/ |
| Matching URLs | sales.example.com/products/about.html |
The following example matches all URLs in the products directory
and all sub-directories under products on all sites in the example.com domain.
| Format | <domain>/<directory>/ |
|---|---|
| Example | example.com/products/ |
| Matching URLs | accounting.example.com/products/prices.htm |
The following example matches all URLs in an images directory or
sub-directory, in any side.
Note: If one of the pages on a site links to
another external site or domain, this example would also match the /image/ directories
of those external sites.
| Format | /<directory>/ |
|---|---|
| Example | /images/ |
| Matching URLs | www.example1423.com/images/myVacation/ |
To match a specific file, specify its name in the pattern and add the dollar ($) character to the end of the pattern. Each of the following examples will only match one page.
| Format | <site>/<directory>/<file>$ |
|---|---|
| Example | www.example.com/products/foo.html |
| Format | <domain>/<directory>/<file>$ |
|---|---|
| Example | example.com/products/foo.html |
| Format | /<directory>/<file>$ |
|---|---|
| Example | /products/foo.html |
| Format | /<file>$ |
|---|---|
| Example | /mypage.html |
Without the dollar ($) character at the end of the pattern, the URL pattern may match more than one page.
| Format | /<directory>/<file> |
|---|---|
| Example | /products/mypage.html |
| Matching URLs | /products/mypage.html |
To match URLs that are accessible by a specific protocol, specify the
protocol in the pattern. The following example matches HTTPS URLs that contain
the products directory.
| Format | <protocol>://<site>/<path>/ |
|---|---|
| Example | https://www.example.com/products/mydir/mydoc.txt/ |
To match URLs that are accessible by means of a specific port, specify the port number in the pattern. If you don't specify a port, the search appliance uses the default port, which is 80 for HTTP and 443 for HTTPS.
www.example.com:*/www.example.com:8888/www.example.com/Note: If you explicitly include a port number, the pattern matches only URLs
that explicitly include the port number, even if you use the default port. For
example, a URL pattern that includes www.example.com:80/products/ does
not match www.example.com/products/.
To match the beginning of a URL, add the caret (^) character to the start of the pattern. Do not match a prefix character followed by only a protocol because the result could resolve to most of the Internet.
| Format | ^<protocol>://<site>/<directory>/ |
|---|---|
| Example | ^http://www.example.com/products/ |
| Format | ^<protocol>://<site>/ |
|---|---|
| Example | ^http://www.example.com/ |
| Format | ^<protocol> |
|---|---|
| Example | ^https |
| Format | ^<protocol>://<partial_site> |
|---|---|
| Example | ^http://www.example |
| Matching URLs | http://www.example.com/ |
To match the end of a URL, add the dollar ($) character to the end of the pattern.
The following example
matches http://www.example.com/mypage.jhtml, but not http://www.example.com/mypage.jhtml;jsessionid=HDUENB2947WSSJ23.
| Format | <protocol>://<site>/<directory>/<file>$ |
|---|---|
| Example | http://www.example.com/mypage.jhtml$ |
| Format | <site>/<directory>/<file>$ |
|---|---|
| Example | www.example.com/products/mypage.html$ |
| Format | <domain>/<directory>/<file>$ |
|---|---|
| Example | example.com/products/mypage.html$ |
| Format | /<directory>/<file>$ |
|---|---|
| Example | /products/mypage.html$ |
The following example matches mypage.htm, but does not match mypage.html.
| Format | <file>$ |
|---|---|
| Example | mypage.htm$ |
The following example is useful for specifying all files of a certain
type, including .html, .doc, .ppt, and .gif.
| Format | <partial_file_name>$ |
|---|---|
| Example | .doc$ |
To exactly match a single URL, use both caret
(^) and dollar ($). The following example matches only the URL: http://www.example.com/mypage.jhtml
| Format | ^<exact url>$ |
|---|---|
| Example | ^http://www.example.com/mypage.jhtml$ |
To match URLs with a specified string use the contains: prefix.
The following example matches any URL containing the string "product."
| Format | contains:<string> |
|---|---|
| Example | contains:product |
| Matching URLs | http://www.example.com/products/mypage.html |
To match SMB (Server Message Block) URLs, the pattern must have a fully-qualified domain name and begin with the smb: protocol.
SMB URLs refer to objects that are available
on SMB-based file systems, including files, directories, shares, and hosts. SMB
URLs use only forward slashes. Some environments, such as Microsoft Windows,
use backslashes ("\") to separate file path components. However, for
these URL patterns, you must use forward slashes.
The following example shows the correct structure of an SMB URL.
| Format | smb://<fully-qualified-domain-name>/<share>/<directory>/<file> |
|---|---|
| Example | smb://fileserver.domain/myshare/mydir/mydoc.txt |
The following SMB URL patterns are not supported:
smb://smb://myshare/mydir/smb://workgroupID/myshare/
The exception patterns below cannot be used with any version of the Google Connector for Microsoft SharePoint.
To specify exception
patterns, prefix the expression with a hyphen (-).
The following example includes sites in the example.com domain,
but excludes secret.example.com.
| Format | -<expression> |
|---|---|
| Example | example.com/ |
The following example excludes any URL that contains content_type=calendar.
| Example | -contains:content_type=calendar |
|---|
You can override the exception interpretation of the hyphen (-) character by preceding the hyphen (-) with a plus (+).
| Example | +-products.xls$ |
|---|---|
| Matching URLs | http://www.example.com/products/new-products.xls |
A Google regular expression describes a complex set of URLs. For more information on GNU regular expressions, see the Google Search for "gnu regular expression tutorial". Google regular expressions are similar to GNU regular expressions, with the exception of the following differences:
regexpIgnoreCase:regexpCase: and regexp: prefixes
can be used to specify case sensitivity.Metacharacters are either a special character or special character combination, which is used in a regular expression to match a specific portion of a pattern. Metacharacters are not used as literals in regular expressions. The following list describes available metacharacters and metacharacter combinations:
^.[$()|*+?{\The following example
matches any URL that references an images directory on www.example.com using
the HTTP protocol.
| Example | regexp:http://www\\.example\\.com.*/images/ |
|---|---|
| Matching URLs | http://www.example.com/images/logo.gif |
The following example
matches any URL in which the server name starts with auth and
the URL contains .com.
| Example | regexpCase:http://auth.*\\.com/ |
|---|---|
| Matching URLs | http://auth.www.example.com/mypage.html |
This example does not match http://AUTH.engineering.example.com/mypage.html because
the expression is case sensitive.
The following pattern
matches JHTML pages from site www.example.com. These pages
have the jsessionid, type=content parameters,
and id.
| Example | regexp:^http://www\\.example\\.com/page\\.jhtml;jsessionid= |
|---|---|
| Matching URLs | http://www.example.com/page.jhtml;jsessionid= |
Note: Do not begin or end a
URL pattern with period+asterisk (.*) if you are using the regexp: prefix,
as this pattern is ineffective and may cause performance problems.
Note: Invalid regular expression patterns entered on the Crawl and Index > Crawl URLs page in the Admin Console can cause search appliance crawling to fail.
For proxy servers, regular expressions are also case sensitive, but must use a single escape character (backslash "\") when reserved characters are added to the regular expression.
Google recommends crawling to the maximum depth, allowing the Google algorithm to present the user with the best search results. You can use URL patterns to control how many levels of subdirectories are included in the index.
For example, the following URL patterns cause the search appliance to crawl the top three subdirectories on the site www.mysite.com:
regexp:www\\.mysite\\.com/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*$
regexp:www\\.mysite\\.com/[^/]*/[^/]*/[^/]*$