Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index

Serving

Status and Reports

Administration

More Information
  Rules for Valid URL Patterns
  Crawling and Indexing
  Spelling
  Hexadecimal Notation
  Font Families
  Security and Error Handling
  Search Protocol Reference
  Index

Appendix A: Rules for Valid URL Patterns

When specifying the URLs that should/should not be crawled on your site or when building URL-based collections, your URLs must conform to the valid patterns listed below. For more information, see the Constructing URL Patterns documentation on the Google Support site.

Valid URL Patterns Examples Explanation

Any substring of a URL that includes the host/path separating slash

http://www.google.com/

Any page on www.google.com using the HTTP protocol.

www.google.com/

Any page on www.google.com using any supported protocol.

google.com/

Any page in the google.com domain.

Any suffix of a string. You specify the suffix with the $ at the end of the string.

home.html$

All pages ending with home.html.

.pdf$

All pages with the extension .pdf.

Any prefix of a string. You specify the prefix with the ^ at the beginning of the string. A prefix can be used in combination with the suffix for exact string matches. For example, ^candy cane$ matches the exact string for "candy cane."

^http://

Any page using the HTTP protocol.

^https://

Any page using the HTTPS protocol.

^http://www.google.com/page.html$

Only the specified page.

An arbitrary substring of a URL. These patterns are specified using the prefix "contains".

contains:coffee

Any URL that contains "coffee."

contains:beans

Any URL that contains "beans."

Exceptions denoted by - (minus) sign.

candy.com/
-www.candy.com/

Means that "www.chocolate.candy.com" is a match, but "www.candy.com" is not a match.

Regular expressions from the GNU Regular Expression library.
In the appliance, regular expressions:
(1) are case sensitive (unless you specify "regexpIgnoreCase:")
(2) must use two escape characters (backslashes "\\") when reserved characters are added to the regular expression.

Note: regexp: and regexpCase: are equivalent.

regexp:-sid=[0-9A-Z]+/

regexp:http://www\\.example\\.google\\.com/.*/images/

regexpCase:http://www\\.example\\.google\\.com/.*/images/

regexpIgnoreCase:http://www\\.Example\\.Google\\.com/.*/IMAGES/

See the GNU Regular Expression library.

Comments

#this is a comment

Empty lines and comments starting with # are permissible. These comments are removed from the URL pattern and ignored.



 
© Google Inc. 2007