Google Search Appliance software version 6.0
Posted June 2009
This section provides more detailed explanation of how to set up crawl for controlled-access content using HTML forms authentication, and how to enable serve for public and secure documents. HTML forms authentication permits integration with an existing single sign-on system or login server.Skip over Contents
Span Reports sells reports on the top 500 companies in its field, and wants to make short excerpts from its business reports available through search. Customers who view the excerpts can then decide whether to purchase access to view the full article.
Span Reports uses a login server to manage customer access to business reports. A web proxy server placed between the search appliance and the Internet acts as a gateway to the search appliance, allowing Span Reports to control and track searches on their site.
http://spanreports.com/login/login.html is the login form for Span Reports' single sign-on login server.www.spanreports.com is a web server that hosts business reports that are available for purchase. This server uses persistent cookies that never expire.www.spanreports.com is located in the root directory, while reports are in the subdirectory www.spanreports.com/reports/. it.spanreports.com is another web server that hosts business reports that are available for purchase, but this server uses cookies that expire. All these servers are located on the same domain. Although authentication is required to access the full text of a report, Span Reports wants to serve the snippet results as public content, viewable by anyone.
Span Reports has these people who interact with this content:
Caution: When controlled-access content is served as "public" by a search appliance (as shown in this use case), it is available to any user who is able to perform a search query. If you make controlled-access content available to unknown users for public search, you should devise additional protective measures to ensure security. The search appliance does not provide security for documents that are labeled as "public" in the index.
First, the system administrator creates a user account for the search appliance, called crawler, and sets up access policies that ensure that the crawler user account is authorized to view all files on www.spanreports.com and it.spanreports.com.
Next, the search appliance administrator, Steve, logs into the Admin Console and performs these actions:
http://www.spanreports.com/ and http://it.spanreports.com/IT_reports/.http://spanreports.com/login/login.html, and under URL pattern for this rule, enters http://www.spanreports.com/reports/, and then clicks Create a New Forms Authentication Rule. crawler user account, and saves the forms authentication rule. The search appliance stores the rule for use in crawl for all content under http://www.spanreports.com/reports/. The content can be public or secure for use with a forms authentication rule. For this example, we assume that the content is public. When a cookie expires, the search appliance uses the stored crawler account credentials to request a new session cookie. http://www.spanreports.com/reports/, Steve selects the Make Public checkbox and clicks Save Forms Authentication Rule Configuration to apply the change. Content from this directory is labeled as "public" in the index.Now that the search appliance has access to all of the business and IT reports created by Span Reports, the search appliance administrator schedules a crawl and waits for the controlled-access content to appear in the index.
During crawl, the search appliance goes through each of the content sources that have been configured:
http://www.spanreports.com/. The web server allows the search appliance to crawl and index all the public content. http://www.spanreports.com/reports/. The web server asks for a session cookie: the search appliance recognizes the URL pattern and provides the cookie that was set by the forms authentication rule for http://www.spanreports.com/reports/. The web server verifies that crawler has access to view documents in the controlled access directory. The search appliance crawls through all documents on http://www.spanreports.com/reports and adds the documents to the index. http://it.spanreports.com/IT_reports/ and provides the cookie that was set for that URL in the Admin Console under Crawl and Index > Forms Authentication. The web server verifies that crawler has access to view documents in the controlled access directory. The search appliance crawls through all documents on http://it.spanreports.com/IT_reports/ and adds them to the index. Because these documents were accessed through a forms authentication rule with Make Public selected, they are labeled as "public" in the index. Span Reports has decided to make the search results public: although users must purchase the reports in order to view the full text, anyone can discover which reports are relevant by performing a search query.
Carlos is an investor who wants to know whether the site offers a report on ABC Company's presence in Japan. Carlos opens the search page in a web browser and enters a query for "ABC Company Japan".
The search appliance performs the following steps before sending Carlos to the search results page:
www.spanreports.com, www.spanreports.com/reports/, and it.spanreports.com/IT_reports/. Content on www.spanreports.com doesn't require a login. For any links that point to files in the top-level directory, Carlos doesn't have to enter his credentials to view the content. However, when Carlos clicks a link to a controlled access report, the server that hosts the page asks for authentication. If Carlos hasn't logged in, he has to enter a username and password. Although the search appliance indexed the content as "public", the server still requires credentials before it displays a full document.
The next time that Carlos clicks a link on his search results page, however, his web browser provides the session cookie that was set when he logged in. If all the servers in this example are on the same domain and accept the same credentials, Carlos shouldn't have to log in again for as long as he keeps the browser open.
After a brief promotional period, the Span Reports company from Use Case 4 wants to change its access policy so that IT reports are discoverable only to registered members. IT reports are in the controlled access directory: it.spanreports.com/IT_reports/. The search appliance administrator, Steve, has some work to do.
it.spanreports.com. He opens Crawl and Index > Forms Authentication. http://www.spanreports.com/reports/, Steve clears the Make Public checkbox and clicks Save Forms Authentication Rule Configuration to apply the change. Content from this directory is labeled as "secure" in the index.Now that the search appliance has access to all of the business and IT reports created by Span Reports, and the IT reports are no longer made public, the search appliance administrator schedules a crawl and waits for the change to appear in the index.
Now that the search appliance has a rule that creates secure content, the search appliance administrator must define rules for how that content is served to users.
http://spanreports.com/login/login.html.http://it.spanreports.com/IT_reports/index.html. This is a landing page that redirects unregistered viewers to a login form, and that all registered users can view, once logged in.During crawl, the search appliance goes through each of the content sources that have been configured:
http://www.spanreports.com/. The search appliance crawls and indexes content and labels it "public" as before. http://it.spanreports.com/IT_reports/. The web server asks for a session cookie: the search appliance recognizes the URL pattern and provides the cookie that was set in the Admin Console under Crawl and Index > Forms Authentication. The web server verifies that crawler has access to view documents in the controlled access directory. The search appliance crawls through all documents on http://it.spanreports.com/IT_reports/ and adds them to the index. However, because these documents were accessed through a forms authentication rule with Make Public cleared, this time, they are labeled as "secure" in the index. Span Reports now has public and secure search results available on the search appliance: general reports are available to anyone, while IT reports are only available to authorized users who have purchased a subscription.
Carlos is an investor who is interested in viewing an IT report about another company, "XYZ Corp". Carlos opens the search page in a web browser and enters a query for public and secure content about "XYZ Corp IT Evaluation". The search appliance performs the following steps before sending Carlos to the search results page:
, the search appliance should be authorized to request all of the secure IT reports when passing his cookie. www.spanreports.com, www.spanreports.com/reports/, and it.spanreports.com/IT_reports/. Because the search appliance creates a session cookie on Carlos' computer, he doesn't have to enter his credentials again. When he clicks on a link in his search results, his browser includes both Carlos' search appliance session cookie and the cookie that it saved when it originally proxied Carlos' login form. By re-using the saved cookie from the login form, the search appliance makes it possible for Carlos to view the document immediately.
The search results page doesn't tell Carlos how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index. Similarly, the search appliance requests authentication any time Carlos searches for public and secure content, regardless of whether any secure content is applicable to this query.
Jenny isn't a subscriber, but she's also interested in finding an IT report on XYZ Company. She opens the search page in a web browser and enters the same query for public and secure content about "XYZ Company." The search appliance performs the following steps before sending Jenny's browser to the search results page:
it.spanreports.com/IT_reports/. Any IT reports that match Jenny's search are removed from the list of potential results.www.spanreports.com and www.spanreports.com/reports/, but nothing from it.spanreports.com/IT_reports/. The search results page doesn't tell Jenny how many search results match her query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.
The Span Reports company from Use Case 5 wants to change its site design to add frames to the login form for the IT reports that are only available to registered members. IT reports are in the controlled access directory: it.spanreports.com/IT_reports/.
Everything stays the same for crawling and indexing, but the search appliance administrator, Steve, needs to change the serving configuration.
Now that the search appliance has a rule that creates secure content, the search appliance administrator must define rules for how that content is served to users.
http://spanreports.com/cgi-bin/login.php.it.spanreports.com share the same cookie domain. http://spanreports.com/cgi-bin/login.php. This web service handles authentication requests and supports a URL redirect back to the search appliance.Span Reports now has public and secure search results available on the search appliance: general reports are available to anyone, while IT reports are only available to authorized users who have purchased a subscription.
Carlos is an investor who is interested in viewing an IT report about another company, "XYZ Corp". Carlos opens the search page in a web browser and enters a query for public and secure content about "XYZ Corp IT Evaluation". The search appliance performs the following steps before sending Carlos to the search results page:
http://spanreports.com/cgi-bin/login.php, and includes a return path URL that points back to the search appliance., the search appliance should be authorized to request all of the secure IT Reports when passing his session cookie. www.spanreports.com, www.spanreports.com/reports/, and it.spanreports.com/IT_reports/. Because Carlos' browser has a session cookie, he doesn't have to enter his credentials again. When Carlos clicks a link in his search results, his browser sends the same cookie that it used to determine authorization during serve, and Carlos us able to view the document immediately.
The search results page doesn't tell Carlos how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.
Jenny isn't a subscriber, but she's also interested in finding an IT Report on XYZ Company. She opens the search page in a web browser and enters the same query for public and secure results about "XYZ Corp IT Evaluation". The search appliance performs the following steps before sending Jenny's browser to the search results page:
http://spanreports.com/cgi-bin/login.php, and includes a return path URL that points back to the search appliance.it.spanreports.com/IT_reports/. Any IT reports that match Jenny's search are removed from the list of potential results.www.spanreports.com and www.spanreports.com/reports/, but nothing from it.spanreports.com/IT_reports/. The search results page doesn't tell Jenny how many search results match her query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.