Google Search Appliance software version 6.0
Posted June 2009
This section provides more detailed explanation of how to set up crawl for controlled-access content using HTTP Basic and Windows Authentication (NTLM HTTP), and how to enable serve for public and secure documents.
Skip over ContentsThe ABC Company wants to make its controlled-access content discoverable using intranet search. The content is stored on these internal servers:
events.abc.int is a simple web server that uses HTTP Basic authentication. This server contains information about internal company events. announce.abc.int is a Microsoft IIS web server that uses Integrated Windows Authentication over NTLM HTTP. This server contains announcements for employees. directory.abc.int is another Microsoft IIS server. This server provides phone and office location information about employees. For the purpose of this example, let's suppose that content from this server is best provided by a web feed.All these servers are located on the same domain, abc_corp. Although authentication is required by each of these servers, this information isn't sensitive. ABC Company wants to serve the snippet results as public content, viewable by any employee. There is no reason to require the search appliance to perform document-level authentication when serving results.
ABC Company has these people who interact with this content:
First, the system administrator creates a user account for the search appliance, called ABCsearch, and sets up access policies that ensure that the ABCsearch user account is authorized to view all files on events.abc.int, and announce.abc.int. The feed process on directory.abc.int has its own account with similar permissions, called ABCfeeder.
Next, the search appliance administrator logs into the Admin Console and performs these actions:
To provide the search appliance with credentials for crawl and index, Sandra opens Crawl and Index > Crawler Access, and adds rows using the account names and passwords given to her by the system administrator:
| For URLs Matching Pattern, Use: | Username: | In Domain: | Password: | Confirm Password: | Make Public: |
|---|---|---|---|---|---|
| https://events.abc.int/ | ABCsearch |
****** | ****** | X | |
| https://announce.abc.int/ | ABCsearch |
abc_corp |
****** | ****** | X |
| https://directory.abc.int/ | ABCfeeder |
abc_corp |
****** | ****** | X |
Here, omitting the domain for events.abc.int instructs the search appliance to authenticate using HTTP Basic. For all other servers in this example, the domain entry tells the search appliance to authenticate against a Microsoft IIS Server using NTLM HTTP.
Because Basic Authentication sends credentials as base-64 encoded clear text, the patterns for events.abc.int all use HTTPS, which protects user names and passwords. Although the use of HTTPS is recommended for Basic Authentication, the search appliance can also authenticate over HTTP.
https://events.abc.int/" and "https://announce.abc.int/". https://events.abc.int/", "https://announce.abc.int/", and "https://directory.abc.int/" under Follow and Crawl Only URLs with the Following Patterns.directory.abc.int, using the following syntax:<record url="http://directory.abc.int/" authmethod="ntlm">authmethod=ntlm, the search appliance attempts to authenticate using NTLM HTTP when crawling this content. Now that the search appliance has access to all of ABC Company's press releases, the search appliance administrator starts the crawl and waits for the controlled-access content to appear in the index.
During crawl, the search appliance goes through each of the content sources that have been configured, and uses the credentials under Crawler Access to obtain the controlled-access content.

Figure 2: The search appliance can use multiple protocols to crawl and index controlled-access content.
events.abc.int over HTTPS. The web server asks for credentials using HTTP Basic Authentication: the search appliance provides the username "ABCsearch" and the password entered in the Admin Console. The web server verifies that ABCsearch has access to view documents on events.abc.int. The search appliance crawls through all documents on events.abc.int and adds them to the index.announce.abc.int over HTTPS. The Microsoft IIS server asks for credentials using Windows Authentication: the search appliance provides an NTLM HTTP message that contains the username "ABCsearch" and a response based on the password entered in the Admin Console. The IIS server verifies that ABCsearch has access to view documents on announce.abc.int. The search appliance crawls through all documents on announce.abc.int and adds them to the index.directory.abc.int with authmethod=ntlm. It connects to directory.abc.int over HTTPS. The Microsoft IIS server asks for credentials using Windows Authentication: the search appliance provides an NTLM HTTP message that contains the username "ABCfeeder" and a response based on the password entered in the Admin Console. The IIS server verifies that ABCfeeder has access to view documents on directory.abc.int. The search appliance crawls through all documents on directory.abc.int and adds them to the index.ABC Company has decided to make the search results public: the events, announce, and directory servers control access to their content, but employees can discover the information they need by performing a search query.
Eric is an employee of ABC Company. He wants to find an announcement about a colleague's recent promotion to Director. Eric opens the search page in a web browser and enters a query about "Maria Jones director". The search appliance performs the following steps before sending Eric to the search results page:
events.abc.int, announce.abc.int, and directory.abc.int that match the query "Maria Jones director". For instance, Eric finds an all-hands meeting that Maria scheduled from events, a notice about her promotion from announce, and her office phone number and location from directory. When Eric clicks on one of the links in the search results page, the server that hosts the page requests a response that includes an authentication header. If Eric hasn't logged in elsewhere, he'll have to enter a username and password. Although the search appliance indexed the content as "public", the server still requires credentials before it displays the full document.
The next time that Eric clicks a link on his search results page, however, his browser forwards an authentication header based on his user name and password to the server. If all the servers in this example are on the same domain and accept the same credentials, Eric shouldn't have to log in again for as long as he keeps the browser open.
The ABC Company from Use Case 1 now wants to make its sales collateral materials discoverable using search. The sales documents are stored on internal servers as shown in the following table.
| Server | Description |
|---|---|
sales.abc.int |
A simple web server that uses HTTP Basic authentication. This server stores general information shared by everyone on the sales team. |
customers.abc.int |
A Microsoft IIS server that stores customer directory information, such as phone numbers and addresses. For the purpose of this example, the content from this server is provided by a web feed. |
As before, all the servers are located on the same domain, abc_corp. Although ABC Company's press releases are available to anyone, ABC Company wants to ensure that only members of the sales team see snippet results for sales collateral materials. Sales documents should not be discoverable by anyone else.
ABC Company has these people who interact with this content:
ABC Company's search appliance already has some content sources that are defined in Use Case 1. Now ABC Company wants to add content from their sales servers. Because the company wants to limit access to this content by restricting content to the sales team, the system administrator must create a system-wide policy for members of this group.
(sales) to view the content on sales.abc.int and customers.abc.int. ABCsearch, and ABCfeeder to this group as well. The sales group has the following members { ABCsearch, ABCfeeder, salimb, ... }. Salim B. is a sales manager; although we haven't described Salim's team, the members of his team are members of sales as well.
Now that we have a security policy that allows access to the servers, Sandra the search appliance administrator, can configure crawl and index. Sandra logs into the Admin Console and performs the following actions:
To provide the search appliance with credentials for crawl and index,
Sandra opens Crawl and Index > Crawler Access, and adds the following rows,
using the account names and passwords for ABCsearch, ABCfeeder,
and ABCeng that were given to her by Adam, the system administrator:
| For URLs Matching Pattern, Use: | Username: | In Domain: | Password: | Confirm Password: | Make Public: |
|---|---|---|---|---|---|
| https://sales.abc.int/ | ABCsearch |
****** | ****** | ||
| https://customers.abc.int/ | ABCfeeder |
abc_corp |
****** | ****** |
Note that unlike Use Case 1, the Make Public checkbox is cleared. The search appliance has full access to these servers, but labels any results from them as "secure" and requires authentication and authorization checks before displaying secure content in the search results.
https://sales.abc.int/ https://customers.abc.int/
customers.abc.int, using the following syntax:<record url="http://customers.abc.int/" authmethod="ntlm">authmethod=ntlm, the search appliance attempts to authenticate using NTLM HTTP when crawling and serving this content. Now that the search appliance has access to all of ABC Company's sales collateral, the search appliance administrator starts the crawl and waits for additional controlled-access content to appear in the index.
Note: Because HTTP Basic passes user credentials as clear text, Google recommends that you use HTTPS for all requests for controlled-access content. In these examples, we'll assume that Sandra has configured the search appliance to always perform crawl, index, and serve over HTTPS. For more on this topic, see Protecting the User's Credentials for Serve with HTTP Basic and NTLM HTTP.
During crawl, the search appliance goes through each of the new controlled-access content sources in the same way that it did for the content sources in Use Case 1. When the crawl completes, the index contains content from the following controlled access sources:
| Content Source | Content Acquired by | Serve Method |
|---|---|---|
| https://events.abc.int/ | Crawl HTTPS using Basic Authentication | public |
| https://announce.abc.int/ | Crawl HTTPS using Windows Authentication (NTLM) | public |
| https://directory.abc.int/ | Web feed using Windows Authentication (NTLM) | public |
| https://sales.abc.int/ | Crawl HTTPS using Basic Authentication | secure |
| https://customers.abc.int/ | Web feed using Windows Authentication (NTLM) | secure |
ABC Company now has public and secure search results available on the search appliance: events, announcements, and the employee directory are available to anyone, while sales information, and the customer directory, are only available to members of the sales team.
Salim is the sales manager at ABC Company. He wants to find a presentation that discusses pricing for the new "ABC Product" release. Salim opens the search page in a web browser and enters a query for public and secure results about "ABC Product". The search appliance performs the following steps before sending Salim's browser to the search results page:
salimb, and a password. sales, the search appliance should be authorized to request all of the secure sales collateral materials when passing his credentials. events.abc.int, announce.abc.int, directory.abc.int, sales.abc.int, and customers.abc.int. When Salim clicks on one of the links in his search results page, the browser provides his credentials in the authentication header. If all the servers in this example are on the same domain and accept the same credentials, Salim shouldn't have to log in again for as long as he keeps the browser open.
The search results page doesn't tell Salim how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.
Eric isn't a member of the sales team, but he's also interested in the new "ABC Product" release. Eric opens the search page in a web browser and enters the same query for "ABC Product". The search appliance performs the following steps before sending Eric's browser to the search results page:
ericp, and a password. sales, the search appliance fails its authorization check using Eric's credentials. It removes all of the secure sales collateral materials from the list of potential results. events.abc.int, announce.abc.int, and directory.abc.int, but unlike Salim, he doesn't see any results from sales.abc.int or customers.abc.int. The search results page doesn't tell Eric how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.
The ABC Company from Use Case 1 upgrades the older servers sales.abc.com and events.abc.com and implements a new security policy that uses Integrated Windows Authentication (IWA) on all machines throughout their internal domain. The domain controller is a Windows server named hal.abc.com.
Our search appliance administrator, Sandra, wants to use Kerberos authentication to enable the search appliance to silently authenticate the user without requiring an HTTP Basic login box.
Once again, ABC Company has these people who interact with this content:
searchappliance.keytab. sales.abc.int. Now she wants to configure the search appliance to check for a user's session ticket during serve. Sandra opens Crawl and Index > Crawler Access. hal.abc.com, and clicks Save Kerberos KDC Hostname to save the change. searchappliance.keytab, and clicks OK to upload the Kerberos key table file to the search appliance. She clicks Import Kerberos Keytab File to save the change, and exits the Admin Console. Now that the search appliance is configured to use Kerberos authentication, any time a user requests secure content, the search appliance attempts to authenticate with the user's Kerberos session key. No additional setup is needed for secure serve.
ABC Company now has public and secure search results available on the search appliance, and the search appliance is able to authenticate users against a Windows Domain Controller.
Salim is looking for a detailed report that discusses sales figures for the new "ABC Product" release. Salim opens the search page in a web browser and enters a query for "ABC Product fall sales report".
The search appliance performs the following steps before sending Salim's browser to the search results page:
sales, the search appliance should be authorized to request all of the secure sales collateral materials when passing his credentials. events.abc.int, announce.abc.int, directory.abc.int, sales.abc.int, and customers.abc.int. When Salim clicks on one of the links in his search results page, the browser provides his Kerberos ticket in the authentication header. The next time that Salim performs a search, the search appliance recognizes his session cookie and skips directly to the HTTP HEAD request in step 8. The session cookie set by the search appliance remains valid as long as he keeps the browser open.
The search results page doesn't tell Salim how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.
Eric isn't a member of the sales team, but he's also interested in the new ABC Product release and wants to know when the sales figures will be posted. Eric opens the search page in a web browser and enters the same query for ABC Product fall sales report. The search appliance performs the following steps before sending Eric's browser to the search results page:
sales, the search appliance fails its authorization check using Eric's credentials. It removes all of the secure sales collateral materials from the list of potential results. events.abc.int, announce.abc.int, and directory.abc.int, but unlike Salim, he doesn't see any results from sales.abc.int or customers.abc.int. The search results page doesn't tell Eric how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.