Back to Home | Help Center | Log Out
 Help Center
 
Help Center

Home

Crawl and Index
  Crawl URLs
  Databases
  Feeds
  Crawl Schedule
  Crawler Access
  Proxy Servers
  Cookie Sites
  Forms Authentication
  HTTP Headers
  Duplicate Hosts
  Document Dates
  Host Load Schedule
  Index Rollback
  Freshness Tuning
  Collections

Serving

Status and Reports

Administration

More Information

Crawl and Index > Freshness Tuning

The Freshness Tuning page lets you fine-tune the timing of crawls on different URLs. You can control crawling to be more frequent, as for a news documents, or less frequent, as for archived documents. You can also recrawl URLs that would not normally be recrawled, if you have documents on a server that is not responding correctly to If-Modified-Since headers in GET requests.

Crawl Frequently

You may have content that changes frequently, as often as once an hour or even every few minutes. On the Crawl and Index > Freshness Tuning page, you can specify the URL patterns of pages that change frequently, so that they are crawled often, keeping your serving index fresh.

It is possible to slow the system down by overloading the frequently changing content section. Try to keep the number of URLs fairly small to avoid reduced performance.

To set options for crawling frequently changing content:

  1. Under Crawl Frequently, enter URL patterns for content that changes often.
  2. Click the Save Changes button.
  3. In the left-side menu, click Crawl and Index, then click the Crawl URLs link.
  4. Check the URLs in the Start Crawling from the Following URLs box to make sure the documents can be reached.
  5. Check the URLs in the Follow and Crawl Only URLs with the Following Patterns box to make sure the patterns you entered in the Crawl Frequently section are included.

Crawl Infrequently

To index documents that are never updated or modified, such as a stable database, or that are only incrementally added to, such as in a mail or a news archive, you can have the crawler reuse URLs that have already been crawled. This reusing of URLs reduces the load on your web servers. Make sure that the archival URL patterns you specify can be reached from the Start URLs and are in the Follow and Crawl Only URLs with the Following Patterns box.

Example:

Using a Lotus Domino database that is never modified, with a URL of http://myhost.com/mydb.nsf, you would add this pattern to the Archives URL Patterns in the Freshness Tuning page:

http://myhost.com/mydb.nsf

After the initial indexing of that URL, the crawler would fetch all pages in mydb.nsf from the local cache.

If the database is append only, that is, new documents are added, but old ones are not modified, then use these patterns:

regexp:http://myhost\\.com/mydb\\.nsf/.*\\?OpenDocument.*
regexp:http://myhost\\.com/mydb\\.nsf/.*\\$FILE.*

The crawler will first try to fetch documents or newly added attachments in mydb.nsf from the local cache when possible. The crawler will still fetch views (?OpenView URLs) from the remote domino server, if the database is actually changed, that is, when new documents are added.

To set options for crawling archival servers:

  1. Under Crawl Infrequently , enter URL patterns for rarely changing or archived documents.
  2. Click the Save Changes button.
  3. In the left-side menu, click Crawl and Index, then click the Crawl URLs link.
  4. Check the URLs in the Start Crawling from the Following URLs box to make sure the archived documents can be reached.
  5. Check the URLs in the Follow and Crawl Only URLs with the Following Patterns box to make sure the patterns you entered in the Crawl Infrequently section are included.

Always Force Recrawl

The first time URLs are crawled, the data is indexed and stored on disk. Subsequently, to allow for faster crawls and less load on the servers, only files modified after the date in the Appliance's If-Modified-Since request header will be recrawled. These updates are added to the index.

Enter URL patterns in the Always Force Recrawl section only if out-of-date pages are displayed in your index. Although the crawler does try to figure out the servers with wrong dates and to adjust automatically, other types of misconfigurations may be present.

Make sure that your servers maintain the correct time. If you think one or more of your web servers does not support the If-Modified-Since option or is misconfigured, use this section to enter URL patterns to recrawl. Refer problems with your web servers to your webmaster.

To force recrawling certain URL patterns, regardless of your web server's response to If-Modified-Since:

  1. Under Crawl and Index, click the Freshness Tuning link.
  2. Under Always Force Recrawl, enter URL patterns for pages to always recrawl regardless of last-modified date.
  3. Click the Save Changes button.

Recrawl These URL Patterns

If you discover that set of URLs you want to have in the search index is not being crawled (usually because changes made to the web pages or because of a temporary error or misconfiguration present when the crawler last tried to crawl the URL), you can enter the pattern here to inject it quickly into the queue of URLs the search appliance is crawling.

Enter the URL pattern and click the Recrawl These URL Patterns button. The URL pattern is placed in the queue, where it will be crawled soon, unless there are higher priority URLs in the queue.


 
© Google Inc. 2007