Before you begin crawling your web content, you must specify
one or more starting locations. You can control and refine
the breadth of the crawl by specifying URL patterns to follow and others
to avoid. For a given URL to be crawled, it must match at least one
URL pattern in the Follow and Crawl Only URLs with
the Following Patterns box and none of the URL patterns in the Do
Not Crawl URLs with the Following Patterns box.
Note: If a URL is matched by patterns from both Follow and Crawl Only URLs with the Following Patterns and Do Not Crawl URLs with the Following Patterns, the URL will not be crawled.
URLs are case sensitive. If you want case-insensitive URLs, use the operator regexpIgnoreCase.
The crawler can access content over HTTP, HTTPS, and SMB protocols. More information about file system crawling using SMB appears below.
The following options let you control and refine your crawls.
Start Crawling from the Following URLs
Starting URLs, entered one per line, control
where the crawl begins. All content that you wish to include in all of the
collections should be reachable by following links from one or more
documents listed in the starting URLs.
These URLs are only the starting point(s) for the crawl.
They tell the crawler where to begin crawling. However, links from the
start URLs will be followed and indexed only if they match a pattern in
Follow and Crawl Only URLs with the Following Patterns. For example, if you specify a starting URL
of http://mycompany.com/ in this section and a pattern
www.mycompany.com/ in the Follow and Crawl Only URLs with the
Following Patterns
section, the crawler will discover links in the http://www.mycompany.com/
web page, but will only crawl and index URLs that match the pattern www.mycompany.com/.
All entries in this window must be fully qualified URLs, using the following format:
<protocol>://<host>[:port]/[path]
In this format, the protocol can include HTTP, HTTPS (for secure content) or SMB (for fileshares).
The information contained in square brackets [ ] is optional. The forward slash "/" after <host>[:port] is required.
Valid examples:
https://www.example.com/secure/
http://www.example.com/help/
smb://fileshare.mycompany.com/your-sharename/
|
Invalid examples: |
|
Reason: |
|
http://www/ |
|
Invalid because the hostname is not fully qualified. A fully qualified hostname includes the local hostname and the full domain name. For example: mail.corp.company.com. |
|
www.example.com/ |
|
Invalid because the protocol information is missing. |
|
http://www.example.com |
|
The "/" after <host>[:port] is required. |
To enter a new URL, type a valid entry into the window. Press Enter
to add additional URLs, one per line.
Note: This window must contain at least one start URL. The search appliance
will attempt to resolve incomplete path information entered, using the information
entered on the Administration > Network Settings page in
the DNS Suffix (DNS Search Path) section. However, if it cannot
be successfully resolved, the following error message displays in red on the
page:
You have entered one or more invalid start URLs. Please check your edits.
The crawler will retry several times to crawl URLs that are temporarily unreachable.
File System Crawling
SMB (Server Message Block) file shares, sometimes known as "Windows File Shares," are a commonly used network file system. To crawl documents stored in an SMB file share, enter a URI using the smb: protocol, using the following format:
smb://file-server/your-sharename/folder/
Do not start the crawl at the top-level SMB URL. For example, this is an invalid URL:
These files will be indexed and served in public search results. Results links
for documents located on SMB fileshares will be served through the search appliance
and be available to all search users.
Note: If your environment uses a WINS server to look up hostnames, you must also configure the crawler to use this WINS server in the Administration > Network Settings page.
Follow and Crawl Only URLs with the Following Patterns
All entries in Start Crawling from the Following URLs box require a corresponding entry in the Follow and Crawl Only URLs with the Following Patterns box, or an error message will display.
Only URLs matching the patterns you specify (one per line) in this window
will be followed and crawled. This allows you to control which files will be
crawled on your server.
Example:
https://www.example.com/secure/
http://www.example.com:80/help/
smb://fileshare.mycompany.com/my-sharename/
These entries limits the crawl to URLs containing the above strings. For instance, all of the following would be crawled (presuming they are not included in the Do Not Crawl URLs):
https://www.example.com/secure/file.txt
http://www.example.com:80/help/projectA
smb://fileshare.mycompany.com/my-sharename/folder1
The URLs that are discovered are checked against these patterns for inclusion
in the index. Only URLs that match these patterns are crawled and indexed. In order for
a URL to be crawled and indexed, there must be a sequence of links matching
the Follow patterns from one of the Starting URLs. If there is no valid
link path, you should add the URL to the Start Crawling from the Following
URLs section.
The URL patterns you list in this window must conform to the rules
for valid URL patterns. To enter a URL pattern, type a valid pattern into
the window. Press Enter to add additional patterns. Empty
lines and comments (starting with #) are permitted.
URLs on the Crawl URLs page are case sensitive. If you want case-insensitive URL pattern matching, use the operator regexpIgnoreCase. If you want case-insensitive URL pattern matching, use the operator regexpIgnoreCase . For example, suppose you enter the following pattern:
regexpIgnoreCase:http://www.mycompany.com/documents/
That pattern would also match the following URLs:
http://www.mycompany.com/Documents/
http://www.mycompany.com/DOCUMENTS/
Test These Patterns
To test which URLs will be matched by one of the patterns you have
entered in this field, click either of the Test these patterns links to open the Pattern
Tester Utility. This Utility lets you specify a list of URLs on the left
and a set of patterns on the right. It tells you if each URL
is matched by one of the patterns in the set.
When it opens, the Pattern Tester Utility is initialized with your saved
entries from the Crawl and Index > Crawl URLs page. You can enter more URLs and
patterns into the tester utility to best analyze your pattern
sets. However, your modifications will not be saved; you have to explicitly
enter and save them in the Crawl and Index > Crawl URLs page.
After you click the Test These Patterns button, the
results appear on the same page. The green background indicates that at least one of the patterns does match the URLs you want to crawl.
It also shows the first pattern that matched. The red background
shows that none of the patterns matched this URL.
Click the Back to Crawl and Index > Crawl URLs
link to return to the Crawl and Index > Crawl URLs page.
Do Not Crawl URLs with the Following Patterns
Any pure text in a document is extracted and
indexed by a file type search. Graphics, diagrams, and formatting
information are not indexed. You can exclude any particular file format from being
crawled and indexed by defining URL pattern exceptions to prevent crawling
from occurring on those pages. URLs matching the patterns you specify (one
per line) in this window will not be crawled.
This option allows you to prevent specific file types, directories, or other
sets of pages from being crawled. For example, entering the pattern contains:?
in this box will prevent many Common Gateway Interface (CGI) scripts
from being crawled.
The URL patterns you list here must conform to the rules
for valid URL patterns. To enter a URL pattern, type a valid pattern into
the window. Press Enter to add additional patterns on new lines.
Empty lines and comments (starting with #) are permitted.
For your convenience, this box is prepopulated
with many URL patterns and file types, some of which you may
not want the crawler to index. We do not recommend deleting any of the
default patterns unless you detect parts of your site that are
currently being excluded by these rules.
To make a pattern or file type unavailable to the crawler,
remove the # mark in the line containing the file type. For example, to make
Excel files on your servers unavailable to the crawler, change the line
#.xls$
to
.xls$
Test These Patterns
To test the patterns you have entered, click one of
the Test these patterns links. When it opens, the Pattern Tester Utility is initialized with
your saved entries from the Crawl and Index > Crawl URLs page. You can enter more URLs
and patterns into the tester utility to best analyze your pattern sets.
However, your modifications will not be saved; you have to explicitly
enter and save them in the Crawl and Index > Crawl URLs page. After you click the
Test These Patterns button, the results appear on the same page. The
green background indicates that at least one of the patterns does
match the URLs you want to crawl. It also shows the first pattern
that matched. The red background shows that none of the patterns matched this URL.
Click the Back to Crawl and Index > Crawl URLs link to return to the
Crawl and Index > Crawl URLs page.
Note: If the search
should never crawl outside of your intranet site, then we recommend
that you do one or more of the following:
- Configure your network to disallow search appliance
connectivity outside of your intranet.
If you want to make sure that the search appliance never crawls outside
of your intranet, then a person in your IT/IS group needs to specifically
block the search appliance IP addresses from leaving your intranet.
The GB-5005 and GB-8008 use three IP addresses, and these IP addresses
are in your DNS entries as:
googleswitch, googleweb, and googlecrawl. The GB-1001
uses only googleweb.
Your IT/IS group needs to configure either an Access Control List (ACL)
on your external routers or a set of rules on your firewall to
disallow any communication between these IP addresses and the outside
world.
- Make sure all patterns in the field Follow and Crawl Only URLs
with the Following Patterns specify yourcompany.com as the domain name.