My favorites | English | Sign in

Faster apps faster - GWT 2.0 with Speed Tracer New!

Google Search Appliance

Managing Search for Controlled-Access Content: Use Cases with Cookies and HTML Forms-based Authentication

Google Search Appliance software version 6.0
Posted June 2009

This section provides more detailed explanation of how to set up crawl for controlled-access content using HTML forms authentication, and how to enable serve for public and secure documents. HTML forms authentication permits integration with an existing single sign-on system or login server.

Contents

  1. Use Case 4: Cookies or Forms Authentication with Public Serve
    1. Setting up Crawl and Index
    2. Populating the Index for Controlled-Access Content
    3. Serving Controlled-Access Content to the User as Public Content
  2. Use Case 5: Forms Authentication against a Sample Protected URL for Secure Serve
    1. Setting up Crawl and Index
    2. Setting up Serve for Forms Authentication with a Sample Protected URL
    3. Populating the Index for Controlled-Access Content
    4. Serving Controlled-Access Content to the User as Secure Content
      1. Search by an Authorized User
      2. Search by an Unauthorized User
  3. Use Case 6: Forms Authentication with External Login for Secure Serve
    1. Setting up Serve for Forms Authentication with an External Login Server
    2. Serving Controlled-Access Content to the User as Secure Content
      1. Search by an Authorized User
      2. Search by an Unauthorized User

Use Case 4: Cookies or Forms Authentication with Public Serve

Span Reports sells reports on the top 500 companies in its field, and wants to make short excerpts from its business reports available through search. Customers who view the excerpts can then decide whether to purchase access to view the full article.

Span Reports uses a login server to manage customer access to business reports. A web proxy server placed between the search appliance and the Internet acts as a gateway to the search appliance, allowing Span Reports to control and track searches on their site.

  • http://spanreports.com/login/login.html is the login form for Span Reports' single sign-on login server.
  • www.spanreports.com is a web server that hosts business reports that are available for purchase. This server uses persistent cookies that never expire.
  • public web site content for www.spanreports.com is located in the root directory, while reports are in the subdirectory www.spanreports.com/reports/.
  • it.spanreports.com is another web server that hosts business reports that are available for purchase, but this server uses cookies that expire.

All these servers are located on the same domain. Although authentication is required to access the full text of a report, Span Reports wants to serve the snippet results as public content, viewable by anyone.

Span Reports has these people who interact with this content:

  • Andrée, the system administrator
  • Steve, the search appliance administrator
  • Carlos, a customer who may want to purchase a business report

Caution: When controlled-access content is served as "public" by a search appliance (as shown in this use case), it is available to any user who is able to perform a search query. If you make controlled-access content available to unknown users for public search, you should devise additional protective measures to ensure security. The search appliance does not provide security for documents that are labeled as "public" in the index.

Setting up Crawl and Index

First, the system administrator creates a user account for the search appliance, called crawler, and sets up access policies that ensure that the crawler user account is authorized to view all files on www.spanreports.com and it.spanreports.com.

Next, the search appliance administrator, Steve, logs into the Admin Console and performs these actions:

  1. First, Steve opens Google Search Appliance > Crawl and Index > Crawl URLs and makes sure that the controlled access pages are included in the Crawl URL patterns that have been defined. The search appliance's Start Crawling from the following URLs list contains http://www.spanreports.com/ and http://it.spanreports.com/IT_reports/.
  2. Next, to provide the search appliance with credentials for crawl and index for the server that uses persistent cookies, Steve opens Crawl and Index > Forms Authentication.
  3. Under URL of the login page, Steve enters the URL http://spanreports.com/login/login.html, and under URL pattern for this rule, enters http://www.spanreports.com/reports/, and then clicks Create a New Forms Authentication Rule.
  4. The search appliance proxies the login form. Steve enters the credentials for the crawler user account, and saves the forms authentication rule. The search appliance stores the rule for use in crawl for all content under http://www.spanreports.com/reports/. The content can be public or secure for use with a forms authentication rule. For this example, we assume that the content is public. When a cookie expires, the search appliance uses the stored crawler account credentials to request a new session cookie.
  5. Next to the URL Pattern for http://www.spanreports.com/reports/, Steve selects the Make Public checkbox and clicks Save Forms Authentication Rule Configuration to apply the change. Content from this directory is labeled as "public" in the index.

Now that the search appliance has access to all of the business and IT reports created by Span Reports, the search appliance administrator schedules a crawl and waits for the controlled-access content to appear in the index.

Populating the Index for Controlled-Access Content

During crawl, the search appliance goes through each of the content sources that have been configured:

  • The search appliance connects to http://www.spanreports.com/. The web server allows the search appliance to crawl and index all the public content.
  • Following links, the search appliance requests a document in the controlled access directory http://www.spanreports.com/reports/. The web server asks for a session cookie: the search appliance recognizes the URL pattern and provides the cookie that was set by the forms authentication rule for http://www.spanreports.com/reports/. The web server verifies that crawler has access to view documents in the controlled access directory. The search appliance crawls through all documents on http://www.spanreports.com/reports and adds the documents to the index.
  • The search appliance connects to http://it.spanreports.com/IT_reports/ and provides the cookie that was set for that URL in the Admin Console under Crawl and Index > Forms Authentication. The web server verifies that crawler has access to view documents in the controlled access directory. The search appliance crawls through all documents on http://it.spanreports.com/IT_reports/ and adds them to the index. Because these documents were accessed through a forms authentication rule with Make Public selected, they are labeled as "public" in the index.

Serving Controlled-Access Content to the User as Public Content

Span Reports has decided to make the search results public: although users must purchase the reports in order to view the full text, anyone can discover which reports are relevant by performing a search query.

Carlos is an investor who wants to know whether the site offers a report on ABC Company's presence in Japan. Carlos opens the search page in a web browser and enters a query for "ABC Company Japan".

The search appliance performs the following steps before sending Carlos to the search results page:

  1. The search appliance queries the index and obtains a list of relevant results for Carlos' query.
  2. The search appliance filters the list of results as specified by the front end that applies to Carlos' search. It applies Filters defined in Serving > Front Ends > Filters and excludes all URLs listed in URLs from Serving > Front Ends > Remove URLs.
  3. The search appliance checks the list to see whether any of the results require authorization. Although the search appliance had to provide credentials to index the content, the Make Public checkbox is selected for all of Span Reports' content sources. All content in the index is labeled as public: no authorization check is required.
  4. The search appliance directs Carlos' browser to a search results page that contains all reports that match the query "ABC Company Japan". Carlos should see results from www.spanreports.com, www.spanreports.com/reports/, and it.spanreports.com/IT_reports/.

Content on www.spanreports.com doesn't require a login. For any links that point to files in the top-level directory, Carlos doesn't have to enter his credentials to view the content. However, when Carlos clicks a link to a controlled access report, the server that hosts the page asks for authentication. If Carlos hasn't logged in, he has to enter a username and password. Although the search appliance indexed the content as "public", the server still requires credentials before it displays a full document.

The next time that Carlos clicks a link on his search results page, however, his web browser provides the session cookie that was set when he logged in. If all the servers in this example are on the same domain and accept the same credentials, Carlos shouldn't have to log in again for as long as he keeps the browser open.

Back to top

Use Case 5: Forms Authentication against a Sample Protected URL for Secure Serve

After a brief promotional period, the Span Reports company from Use Case 4 wants to change its access policy so that IT reports are discoverable only to registered members. IT reports are in the controlled access directory: it.spanreports.com/IT_reports/. The search appliance administrator, Steve, has some work to do.

Setting up Crawl and Index

  1. First, Steve checks to make sure that forms authentication is applicable for this situation:
    • Span Reports uses a single sign-on server to manage account login.
    • The IT Report content must be served as "secure content".
  2. Next, Steve must make sure that the search appliance has credentials for crawl and index on it.spanreports.com. He opens Crawl and Index > Forms Authentication.
  3. The search appliance already has a forms authentication rule for IT reports. Under Crawl and Index > Forms Authentication, next to the URL Pattern for http://www.spanreports.com/reports/, Steve clears the Make Public checkbox and clicks Save Forms Authentication Rule Configuration to apply the change. Content from this directory is labeled as "secure" in the index.

Now that the search appliance has access to all of the business and IT reports created by Span Reports, and the IT reports are no longer made public, the search appliance administrator schedules a crawl and waits for the change to appear in the index.

Setting up Serve for Forms Authentication with a Sample Protected URL

Now that the search appliance has a rule that creates secure content, the search appliance administrator must define rules for how that content is served to users.

  1. Steve logs into the Admin Console and chooses Serving > Forms Authentication.
  2. First, Steve checks to make sure that forms authentication with a Sample Protected URL is applicable for this situation:
    • When you try to open the Sample Protected URL, the server presents unauthenticated users with a simple login form.
    • The login form is http://spanreports.com/login/login.html.
    • The form uses HTML (it can contain JavaScript, but no frames).
    • The form can redirect the user back to the search appliance, with an added parameter that indicates that this is a redirect back and not a new request.
  3. To enable Forms Authentication, he selects Login against a sample protected URL, and under Sample URL, he enters http://it.spanreports.com/IT_reports/index.html. This is a landing page that redirects unregistered viewers to a login form, and that all registered users can view, once logged in.
  4. He clicks Save Forms Authentication Serving Configuration to save his changes.

Populating the Index for Controlled-Access Content

During crawl, the search appliance goes through each of the content sources that have been configured:

  • For content on http://www.spanreports.com/. The search appliance crawls and indexes content and labels it "public" as before.
  • The search appliance connects to http://it.spanreports.com/IT_reports/. The web server asks for a session cookie: the search appliance recognizes the URL pattern and provides the cookie that was set in the Admin Console under Crawl and Index > Forms Authentication. The web server verifies that crawler has access to view documents in the controlled access directory. The search appliance crawls through all documents on http://it.spanreports.com/IT_reports/ and adds them to the index. However, because these documents were accessed through a forms authentication rule with Make Public cleared, this time, they are labeled as "secure" in the index.

Serving Controlled-Access Content to the User as Secure Content

Span Reports now has public and secure search results available on the search appliance: general reports are available to anyone, while IT reports are only available to authorized users who have purchased a subscription.

Search by an Authorized User

Carlos is an investor who is interested in viewing an IT report about another company, "XYZ Corp". Carlos opens the search page in a web browser and enters a query for public and secure content about "XYZ Corp IT Evaluation". The search appliance performs the following steps before sending Carlos to the search results page:

  1. The search appliance queries the index and obtains a list of relevant results for Carlos' query.
  2. The search appliance filters the list of results as specified by the front end that applies to Carlos' search. It applies filters defined in Serving > Front Ends > Filters and excludes all URLs listed in URLs from Serving > Front Ends > Remove URLs.
  3. Because Carlos has searched for public and secure content, the search appliance must authenticate Carlos' identity. Carlos has not logged into the corporate single sign-on system yet, so his browser does not have search appliance or SSO session cookies cookies to send along with his request. This triggers the search appliance to request authentication.
  4. The search appliance proxies the login form and asks Carlos to enter his credentials.
  5. Carlos logs into the search appliance's login form. The search appliance forwards Carlos' login request to the single sign-on server and saves a copy of the cookie returned by the server. If the search appliance receives cookies from the single sign-on server that Carlos did not already have, such as a corporate-level SSO cookie, the search appliance sends that cookie on to Carlos's browser. This generally means that Carlos does not need to login again when he visits another single sign-on protected application on the same domain.
  6. In each secure search, the search appliance monitors both Carlos's cookies and its single sign-on server cookies for changes, and copies safe changes from one to the other.
  7. Using the cookie retrieved from the single sign-on server, the search appliance performs an HTTP GET request for 0 bytes for each of the secure documents in the list of results. If the server returns "HTTP status 401" (not authorized) for a document, or the authorization attempt is inconclusive, the document is removed from the list of potential results. Because Carlos is a paid subscriber, the search appliance should be authorized to request all of the secure IT reports when passing his cookie.
  8. The search appliance creates a list of search result snippets and URLs that meet all of the following criteria:
    • URLs match Carlos' search query.
    • URLs are not excluded by a filter in Carlos' front end.
    • URLs are not excluded by a Remove URL in Carlos' front end.
    • The URL is public or Carlos has authorization to view the URL.
  9. The search appliance directs Carlos' browser to the search results page that contains all reports that match the query "XYZ Corp". Carlos should see results from www.spanreports.com, www.spanreports.com/reports/, and it.spanreports.com/IT_reports/.

Because the search appliance creates a session cookie on Carlos' computer, he doesn't have to enter his credentials again. When he clicks on a link in his search results, his browser includes both Carlos' search appliance session cookie and the cookie that it saved when it originally proxied Carlos' login form. By re-using the saved cookie from the login form, the search appliance makes it possible for Carlos to view the document immediately.

The search results page doesn't tell Carlos how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index. Similarly, the search appliance requests authentication any time Carlos searches for public and secure content, regardless of whether any secure content is applicable to this query.

Search by an Unauthorized User

Jenny isn't a subscriber, but she's also interested in finding an IT report on XYZ Company. She opens the search page in a web browser and enters the same query for public and secure content about "XYZ Company." The search appliance performs the following steps before sending Jenny's browser to the search results page:

  1. The search appliance queries the index and obtains a list of relevant results for Jenny's query.
  2. The search appliance filters the list of results as specified by the front end that applies to Jenny's search. It applies filters defined in Serving > Front Ends > Filters and excludes all URLs listed in URLs from Serving > Front Ends > Remove URLs.
  3. Because Jenny searched for public and secure content, the search appliance needs more information before it can serve results. Jenny hasn't logged in, so her browser doesn't have a session cookie to send to the search appliance for authorization.
  4. The search appliance proxies the login form and asks Jenny to enter her credentials.
  5. Jenny isn't a subscriber, so she clicks Cancel to exit the login form. The search appliance can't set a session cookie.
  6. The search appliance performs an HTTP GET request of 0 bytes for each of the secure documents in the list of results. Because the search appliance doesn't have a cookie to use in its request, the server returns "HTTP status 401" (not authorized) for all documents in it.spanreports.com/IT_reports/. Any IT reports that match Jenny's search are removed from the list of potential results.
  7. The search appliance creates a list of search result snippets and URLs that meet all of the following criteria:
    • URLs match Jenny's search query.
    • URLs are not excluded by a filter in Jenny's front end.
    • URLs are not excluded by a Remove URL in Jenny's front end.
    • The URL is public or Jenny has authorization to view the URL.
  8. The search appliance directs Jenny's browser to the search results page that contains all reports that match the query "XYZ Corp". Jenny should see results from www.spanreports.com and www.spanreports.com/reports/, but nothing from it.spanreports.com/IT_reports/.

The search results page doesn't tell Jenny how many search results match her query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.

Back to top

Use Case 6: Forms Authentication with External Login for Secure Serve

The Span Reports company from Use Case 5 wants to change its site design to add frames to the login form for the IT reports that are only available to registered members. IT reports are in the controlled access directory: it.spanreports.com/IT_reports/.

Everything stays the same for crawling and indexing, but the search appliance administrator, Steve, needs to change the serving configuration.

Setting up Serve for Forms Authentication with an External Login Server

Now that the search appliance has a rule that creates secure content, the search appliance administrator must define rules for how that content is served to users.

  1. Steve logs into the Admin Console and chooses Serving > Forms Authentication.
  2. First, Steve checks to make sure that Forms Authentication with an external login server is applicable for this situation:
    • Span Reports uses an external login server to check a user's credentials. The web service that handles authentication requests for the external login server is http://spanreports.com/cgi-bin/login.php.
    • The login form isn't simple HTML: the Sample Protected URL method doesn't work.
    • The search appliance and it.spanreports.com share the same cookie domain.
    • The session cookie set by the login form doesn't check for an IP address and can be proxied.
  3. To enable Forms Authentication with an external login server, Steve selects Always redirect to external login server, and under Redirect URL, he enters http://spanreports.com/cgi-bin/login.php. This web service handles authentication requests and supports a URL redirect back to the search appliance.
  4. Steve clicks Save Forms Authentication Serving Configuration to save his changes.

Serving Controlled-Access Content to the User as Secure Content

Span Reports now has public and secure search results available on the search appliance: general reports are available to anyone, while IT reports are only available to authorized users who have purchased a subscription.

Search by an Authorized User

Carlos is an investor who is interested in viewing an IT report about another company, "XYZ Corp". Carlos opens the search page in a web browser and enters a query for public and secure content about "XYZ Corp IT Evaluation". The search appliance performs the following steps before sending Carlos to the search results page:

  1. The search appliance queries the index and obtains a list of relevant results for Carlos' query.
  2. The search appliance filters the list of results as specified by the front end that applies to Carlos' search. It applies filters defined in Serving > Front Ends > Filters and excludes all URLs listed in URLs from Serving > Front Ends > Remove URLs.
  3. Because Carlos has requested public and secure results, the search appliance needs more information before it can serve results.
  4. The search appliance redirects Carlos to the external login server's web service http://spanreports.com/cgi-bin/login.php, and includes a return path URL that points back to the search appliance.
  5. Carlos hasn't logged in, so the external login server redirects Carlos to a login page. Carlos provides his credentials.
  6. The login page creates a session cookie for Carlos' browser, and redirects his browser back to the return path URL specified by the search appliance.
  7. Using the session cookie from Carlos' browser, the search appliance performs an HTTP GET request of 0 bytes for each of the secure documents in the list of results. If the server returns "HTTP status 401" (not authorized) for a document, or the authorization attempt is inconclusive, the document is removed from the list of potential results. Because Carlos is a paid subscriber, the search appliance should be authorized to request all of the secure IT Reports when passing his session cookie.
  8. The search appliance creates a list of search result snippets and URLs that meet all of the following criteria:
    • URLs match Carlos' search query.
    • URLs are not excluded by a filter in Carlos' front end.
    • URLs are not excluded by a Remove URL in Carlos' front end.
    • The URL is public or Carlos has authorization to view the URL.
  9. The search appliance directs Carlos' browser to the search results page that contains all reports that match the query "XYZ Corp IT Evaluation". Carlos should see results from www.spanreports.com, www.spanreports.com/reports/, and it.spanreports.com/IT_reports/.

Because Carlos' browser has a session cookie, he doesn't have to enter his credentials again. When Carlos clicks a link in his search results, his browser sends the same cookie that it used to determine authorization during serve, and Carlos us able to view the document immediately.

The search results page doesn't tell Carlos how many search results match his query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.

Search by an Unauthorized User

Jenny isn't a subscriber, but she's also interested in finding an IT Report on XYZ Company. She opens the search page in a web browser and enters the same query for public and secure results about "XYZ Corp IT Evaluation". The search appliance performs the following steps before sending Jenny's browser to the search results page:

  1. The search appliance queries the index and obtains a list of relevant results for Jenny's query.
  2. The search appliance filters the list of results as specified by the front end that applies to Jenny's search. It applies filters defined in Serving > Front Ends > Filters and excludes all URLs listed in URLs from Serving > Front Ends > Remove URLs.
  3. Because Jenny requested public and secure results, the search appliance needs more information before it can serve results.
  4. The search appliance redirects Jenny to the external login server's web service http://spanreports.com/cgi-bin/login.php, and includes a return path URL that points back to the search appliance.
  5. Jenny hasn't logged in, so the external login server redirects her to a login page.
  6. Jenny isn't a subscriber, so she clicks Cancel to exit the login page. The login server can't set a session cookie. It redirects Jenny's browser back to the return path URL specified by the search appliance.
  7. The search appliance performs an HTTP GET request of 0 bytes for each of the secure documents in the list of results. Because the search appliance doesn't have a session cookie from Jenny to use in its request, the server returns "HTTP status 401" (not authorized) for all documents in it.spanreports.com/IT_reports/. Any IT reports that match Jenny's search are removed from the list of potential results.
  8. The search appliance creates a list of search result snippets and URLs that meet all of the following criteria:
    • URLs match Jenny's search query.
    • URLs are not excluded by a filter in Jenny's front end.
    • URLs are not excluded by a Remove URL in Jenny's front end.
    • The URL is public or Jenny has authorization to view the URL.
  9. The search appliance directs Jenny's browser to the search results page that contains all reports that match the query "XYZ Corp IT Evaluation". Jenny should see results from www.spanreports.com and www.spanreports.com/reports/, but nothing from it.spanreports.com/IT_reports/.

The search results page doesn't tell Jenny how many search results match her query or display "Goooooogle" links, since that reveals how many secure documents exist in the index.