Crawl and Index > Freshness Tuning
The Freshness Tuning page lets you fine-tune the timing of crawls on different URLs. You can control crawling to be more frequent, as for a news documents, or less frequent, as for archived documents. You can also recrawl URLs that would not normally be recrawled, if you have documents on a server that is not responding correctly to If-Modified-Since headers in GET requests.
Crawl Frequently
You may have content that changes frequently, as often as once an hour or even every few minutes. On the Crawl and Index > Freshness Tuning page, you can specify the URL patterns of pages that change frequently, so that they are crawled often, keeping your serving index fresh.
It is possible to slow the system down by overloading the frequently changing content section.
Try to keep the number of URLs fairly small to avoid reduced performance.
To set options for crawling frequently changing content:
- Under Crawl Frequently, enter URL patterns for
content that changes often.
- Click the Save Changes button.
- In the left-side menu, click Crawl and Index, then click the Crawl URLs link.
- Check the URLs in the Start Crawling from the Following URLs box to make
sure the documents can be reached.
- Check the URLs in the Follow and Crawl Only URLs with the
Following Patterns box to make sure the patterns you entered in the
Crawl Frequently section are included.
To index documents that are never updated or modified, such as a stable database,
or that are only incrementally added to, such as in a mail or
a news archive, you can have the crawler reuse URLs that have already been crawled.
This reusing of URLs reduces the load on your web servers. Make sure that the
archival URL patterns you specify can be
reached from the Start URLs and are in
the Follow and Crawl Only URLs with the Following Patterns box.
Example:
Using a Lotus Domino database that is never modified, with a
URL of http://myhost.com/mydb.nsf, you
would add this pattern to the Archives URL Patterns in the Freshness Tuning page:
http://myhost.com/mydb.nsf
After the initial indexing of that URL, the crawler would fetch
all pages in mydb.nsf from the local cache.
If the database is append only, that is, new documents are
added, but old ones are not modified, then use these patterns:
regexp:http://myhost\\.com/mydb\\.nsf/.*\\?OpenDocument.*
regexp:http://myhost\\.com/mydb\\.nsf/.*\\$FILE.*
The crawler will first try to fetch documents or
newly added attachments in mydb.nsf from the local cache when
possible. The crawler will still fetch views
(?OpenView URLs) from the remote domino server, if the database is
actually changed, that is, when new documents are added.
To set options for crawling archival servers:
- Under Crawl Infrequently , enter URL patterns for
rarely changing or archived documents.
- Click the Save Changes button.
- In the left-side menu, click Crawl and Index, then click the Crawl URLs link.
- Check the URLs in the Start Crawling from the Following URLs box to make
sure the archived documents can be reached.
- Check the URLs in the Follow and Crawl Only URLs with the
Following Patterns box to make sure the patterns you entered in the
Crawl Infrequently section are included.
Always Force Recrawl
The first time URLs are crawled, the data is indexed and stored on disk.
Subsequently, to allow for faster crawls and less load on the servers, only
files modified after the date in the Appliance's If-Modified-Since request
header will be recrawled. These updates are added to the index.
Enter URL patterns in the Always Force Recrawl section
only if out-of-date pages are displayed in your index. Although the crawler does try
to figure out the servers with wrong dates and to adjust automatically, other types
of misconfigurations may be present.
Make sure that your servers maintain the correct time. If you think one or
more of your web servers does not support the If-Modified-Since option or
is misconfigured, use this section to enter URL patterns to recrawl. Refer
problems with your web servers to your webmaster.
To force recrawling certain URL patterns, regardless of your web server's response to
If-Modified-Since:
- Under Crawl and Index, click the Freshness Tuning link.
- Under Always Force Recrawl, enter URL patterns for pages to always recrawl regardless of last-modified date.
- Click the Save Changes button.
Recrawl These URL Patterns
If you discover that set of URLs you want to have in the search index is not being
crawled (usually because changes made to the web pages or because of a temporary
error or misconfiguration present when the crawler last tried to crawl the URL),
you can enter the pattern here to inject it quickly into the queue of URLs the
search appliance is crawling.
Enter the URL pattern and click the Recrawl These URL Patterns button. The URL pattern is placed in the queue, where it will be crawled soon, unless there are higher priority URLs in the queue.
|