Google Search Appliance (GB-1001 and GB 7007) software version 4.6.2.S.18 and later
Posted June 2009
This document describes the headrequestor, a process on the search appliance that checks whether a user is authorized to view a secure search result.
The headrequestor is the process on the search appliance that checks whether or not a user is authorized to view a secure search result. The search appliance will use the headrequestor under any of the following conditions:
To understand how the headrequestor works, you need some background on what happens when a user sends a search request to the appliance. Here is what happens:
The user will request a certain number of results, determined by the value of the num parameter, which defaults to 10. In addition, the user specifies a start parameter, which defaults to 0. If the start parameter is larger than zero, for example 50, the search appliance will need to fetch 50 + num results.
If some or all of the search results are marked as secure, the search appliance will need to get authentication credentials from the user.
If the search appliance is configured to serve secure results using Basic Auth or NTLM, it will send a 401 Unauthorized response to the user. The user's web browser will pop up a dialog window asking for the user to enter her username and password. The search appliance will not store the user's password. Each subsequent time the user enters a search request, the username/password will be passed in an encoded format to the search appliance through the Authorization HTTP header. The search appliance can serve secure results using HTTPS on port 443, so the user's password is sent securely to the search appliance.
If you have configured Forms Authentication, the user will be redirected to a login form. The exact mechanism for how the user's authentication cookie gets transferred to the search appliance depends on whether you configure Cookie Forwarding, User Impersonation or an External Login Server.
If the search appliance is configured to use the Authorization SPI, the user can pass her identity using a client certificate or through the Authentication SPI.
The search appliance generates a list of the most relevant documents for the query. The number of documents is always 1000, unless there aren't 1000 relevant documents in the index, in which case the number of documents is the total number of relevant results.
The search appliance will initially generate URLs and snippets for slightly more results than the user requested. For example, if the user requested 10 results, the search appliance will generate the top 15 URLs/snippets. This is done for performance reasons. The request for URLs/snippets is expensive. It is likely that some documents will get filtered in the next step. We want to make allowances for filtering, so that we do not need to repeat the URL/snippet generation stage.
By default, the search appliance filters results that have duplicate snippets or duplicate paths. Users can disable this filtering with the filter parameter. In addition, the search appliance will filter any documents with URLs that match patterns in Remove URLs for the frontend specified.
Filtering at serving time is more expensive than applying filters at indexing time. For example, documents that contain a robots noindex meta tag are not returned in the search results and therefore do not need to be filtered at serving time.
For 4.6.2.x and earlier, if the search appliance does not have sufficient results for the user after filtering, it will generate another set of URLs/snippets. The number that it will generate can depend on many factors. In general it is the following:
( Num of results still needed + 1 ) * ( 1 / Percent valid docs ) + 7
For example, if we tried to get 15 documents, and 10 were filtered, we will need 5 more. The total number of URLs/snippets that will be generated will be 25.
For 4.6.4 and later, if the search appliance does not have sufficient results for the user after filtering, it will generate another set of URLs/snippets. The number of URLs in the set will be 30% more than needed plus 1.
For example, if we tried to get 15 documents, and 9 were filtered, we will need 6 more. The total number of URLs/snippets that will be generated will be 9.
If the user is running a secure search, the search appliance will check that the user is authorized to view each document in the results.
The search appliance will mark documents in the index that are secure. Any document that is crawled with Basic Auth or NTLM is marked as secure. The crawler on the search appliance will request Basic Auth or NTLM URLs without sending its credentials. If it gets a 401 response, it will send appropriate credentials by matching the URL against patterns in Crawler Access. The URL will only be marked as secure if it gave a 401 response. Any URL that matches a Forms Authentication pattern will be marked as secure. If the search appliance is configured to use the Authorization SPI, you must also configure the search appliance to mark documents as secure using Crawler Access or Forms Authentication patterns. If you are using the Authorization SPI, then the search appliance can authenticate users with client certificates or the Authentication SPI, eliminating the need for getting credentials with a Basic Auth login dialog or a Forms Auth login form.
The search appliance will send every secure URL that it has obtained from the above steps to the headrequestor process in a single batch.
If a URL is protected by NTLM or Basic Auth, the search appliance sends a HEAD request to the web server with the user's Basic Auth or NTLM credentials. A typical Basic Auth HEAD request looks like this:
HEAD /path/to/file.html HTTP/1.0 Host: hostname Connection: Keep-Alive User-Agent: gsa-crawler Authorization: Basic base64-encoded-credentials
A HEAD request using NTLM requires a challenge and response so it requires two HTTP requests and responses. Here is an example of the HTTP headers for each of the three stages of an NTLM request showing the initial request, the challenge from the server and the response from the client.
HEAD /test1/ HTTP/1.0 Connection: Keep-Alive Host: ntlmserver:8888 Authorization: NTLM TlRMTVNTUAABAAAAA7IAAAYABgAlAAAABQAFACAAAABURVNUMVpFQUxPVA== HTTP/1.1 401 Access Denied Server: Microsoft-IIS/5.0 Date: Fri, 11 Oct 2002 17:07:43 GMT WWW-Authenticate: NTLM TlRMTVNTUAACAAAAAAAAADAAAAABggAAg4oSng5+tKUAAAAAAAAAAAAAAAAwAAAA Connection: keep-alive Content-Length: 3245 Content-Type: text/html HEAD /test1/ HTTP/1.0 Connection: Keep-Alive Host: ntlmserver:8888 Authorization: NTLM TlRMTVNTUAADAAAAGAAYAHIAAAAYABgAigAAAAwADABAAAAAHAAcAEwAAAAKAAoAaAAAAAAAAA CiAAAAAYIAAFoARQBBAEwATwBUAGkAaQBzAC0AZQBuAHQAZQByAHAAcgBpAHMAZQBUAEUAUwBUADEACEvmEYgvvUlIkhJC+ fXM59kBexzXKC382THVxiD3mOKu64xGDo7/EKFCgB3Drs5b
The Google Search Appliance uses HTTP/1.0 only, so your web servers must support HTTP/1.0 keep-alive.
If your web server advertises that it supports both Basic Auth and NTLM in its WWW-Authenticate headers, then the search appliance will use Basic. An example of these headers is below:
WWW-Authenticate: Basic Authentication WWW-Authenticate: NTLM
If a URL is protected by Forms Authentication, the search appliance sends a GET request to the web server with the user's cookie. The GET request includes the Range header which, if supported by the web server, means that no content will be returned in the body of the response. A typical GET request looks like this:
GET /path/to/file.html HTTP/1.0 Cookie: SMSESSION=cookie-value Range: bytes=0-0 Host: hostname Connection: Keep-Alive
If the search appliance is configured to use the Authorization SPI, the headrequestor will use the authz checker process to send a SAML request to the to the Access Connector URL that is configured in the Admin Console. If the SAML response is indeterminate -- i.e. neither Permit nor Deny -- then the search appliance will also try sending a HEAD or GET request from the headrequestor process, if it has Basic Auth, NTLM or Forms Authentication credentials.
If the search appliance doesn't get sufficient authorized URLs back from the batch sent to the headrequestor it will rerun step #6 above to generate a new batch of URLs to send to the headrequestor.
The hostload settings are used to determine how many simultaneous authorization requests to send to each web server. The default hostload is set to 4, meaning that the search appliance, by default, will not send more than 4 concurrent requests to each web server from the headrequestor. The headrequestor doesn't support hostload exceptions on a per-host basis.
The user is authorized to see that URL in the results if the web server returns a 200, 204 or 206 HTTP status code.
Whether the head requestor follows 301 and 302 redirects depends on the authentication method the search appliance is using.
The headrequestor always sends an HTTP/1.0 keep-alive header so that it can follow redirects without opening a new TCP connection. If there is no redirect, the search appliance closes the TCP connection once it has received the number of bytes specified by the Content-Length header in the HTTP response. If the web server doesn't send a content-length response, the web server itself will close the connection.
If the headrequestor gets a 2XX response from the target of the redirect then it will assume the user is authorized to view that URL.
A 401, 403 or any other response to the headrequestor will cause the URL not to be displayed in that user's search results. Note that an initial 401 response is expected when using NTLM because the search appliance needs to receive a challenge from the web server.
By default, the request from the search appliance's headrequestor will time out after 2.5 seconds. You can configure the request timeout in the Admin Console. The request timeout period includes the DNS lookup, if needed, as well as the web server response time.
When a head request times out, the search appliance tries to terminate the network connection normally by sending a FIN packet. Most web servers will not close the connection on their end, but will continue to respond after the search appliance has sent a FIN. The headrequestor ignores these responses since the TCP connection on the search appliance is closed. the search appliance will send a RST packet to the web server when it tries to respond to the timed out request. In these cases, the TCP connection on the web server will be closed when the web server completes its response.
If you are running a script on your web server that doesn't exit, the web server will not close the connection after sending the response. In this case, the search appliance will send a FIN after the request timeout period. The web server will try to respond to this packet and the search appliance will then send a RST. In these cases, the TCP connection on the web server will be closed after the request timeout, which defaults to 2.5 secs.
If a request from the headrequestor gets a timeout, the search appliance can retry, up to two times, and then it stops trying to authenticate to a particular URL.
The default batch timeout is 5 seconds. If the search appliance doesn't have sufficient results, it will send another batch to the headrequestor. The headrequestor will not return until all URLs in the batch have been tested. the search appliance will return results after 30 seconds, even if the headrequestor is still running. The batch timeout is configurable in the Admin Console, with a maximum permitted value of 25 seconds.
Here is an example to show how long it will take for a user to see a response to a secure query. Lets assume a query has 100 results. We want to display the first 10 in the first results page. Lets assume that five among the first 10 are secure. the search appliance will try to send simultaneous headrequests for about 10 secure results so that it doesn't need to send a second batch of requests if the user is not authorized to view any of the first five results. The headrequestor receives the entire batch of 10 URLs, but it will send not more than 4 requests concurrently to the web server.
the search appliance will send a second batch of headrequests if it doesn't obtain five secure results that the user is authorized to view in the first batch of headrequests. Each batch request can take up to 5 secs. the search appliance continues to send batches of headrequests until it gets 10 good results or until 30 seconds have passed.
By default, the search appliance caches the results of the headrequestor for one hour. It caches up to 10,000 entries. The least recently used entries are purged to make room for new ones. You can flush the cache or set the time out on cache entries in the Admin Console in 4.x versions (go to Admin Console > Serving > Authorization for 4.x or Admin Console > Serving > Access Control for 4.6.x). Versions running 3.4.14 and earlier do not have this capability from the Admin Console. For these versions, the only way to flush the cache is to reboot the search appliance.
The headrequestor can be configured to add a host to a list of unreachable hosts if requests to that host get 100 request timeouts within 120 seconds. You can specify how long a host will remain on the unreachable list. The headrequestor will not send requests for any user to an unreachable host. This can be used to protect hosts that can be overwhelmed by the headrequestor.
If secure documents do not appear in the search results, it may be caused by headrequestor not getting an authorized response. Here are some ways to verify what responses are obtained by headrequestor.
If your web server uses w3c log format, you should include the "time-taken" field as part of their web server log format. Details for this log format are available at: http://www.w3.org/TR/WD-logfile.html
nc hostname 80 <<EOF GET /path/to/file.html HTTP/1.0 Cookie: SMSESSION=cookie-value Range: bytes=0-0 Host: hostname Connection: Keep-Alive EOF
Here is one method to determine whether the search appliance is getting time outs from the headrequestor. The information below is designed for search administrators who are familiar with network troubleshooting tools such as tcpdump.
You can run these commands from a Unix/Linux system or from Windows with Cygwin installed. You may have to modify the commands slightly due to slight differences between various operating systems.
First, generate the authentication credentials. For example, you can use the following command to generate a Basic Authorization HTTP header:
$ echo -n "username:password" | uuencode -m foo begin-base64 640 foo dXNlcm5hbWU6cGFzc3dvcmQ= ====
The search appliance caches results of requests for up to one hour. Therefore, you should select a username that has not made any queries to the search appliance for the past hour.
Next run a series of search queries against the search appliance using these credentials. The later queries may include results that have already been cached by the head requestor. In our tests, we have found that there is not a significant number of cache hits if you do a series of 100 queries on a corpus of 80,000 documents.
Run the search queries against an appliance that isn't answering any queries from other clients and which has a paused crawl. Be sure to run this script from a client that has a fast network connection to the search appliance.
Create a file, named qterms, containing one URL-escaped query term per line, then run the following script:
for q in `cat qterms` do echo -n "$q " time wget -o /dev/null -q --header="Authorization: Basic dXNlcm5hbWU6cGFzc3dvcmQ=" -O /dev/null \ "http:///search?q=$q&site=my_collection&output=xml_no_dtd&client=my_collection&num=10&access=a&filter=0" 2>&1 \ | grep elapsed | sed -e 's/.* \(.*\)elapsed.*/\1/g' done
The output shows the query term and how long it took to return. It will look something like this:
pager 0:01.58 %22cash+cow%22 0:01.65 swot+analysis 0:01.60 computer 0:01.60 phones 0:00.10 paradigm+shift 0:01.63
Note that the behavior of the time command is often system dependent so you may need to alter the substitutions necessary to correctly display the elapsed time.
While this script is running, you can sniff the connection between the search appliance and the web server to see the requests. Assuming you only have one web server, you can run the following command:
tcpdump -i eth0 -w /tmp/dump.out port 80 and host web-server-hostname
The above command will generate a lot of data. You can usually terminate it after a few seconds to get sufficient packets to analyse. Use the following script to analyse the output:
#!/bin/sh
tcpdumpfile=/tmp/dump.out
echo "Port Packets First packet Last packet Total time"
first_packet_time=""
i=0
for port in `tcpdump -r $tcpdumpfile | cut -d ">" -f 2 | grep -v http | cut -b 15-19 | sort | uniq`
do
packets=`tcpdump -r $tcpdumpfile | grep -c "$port"`
first=`tcpdump -r $tcpdumpfile | grep "$port" | cut -d " " -f 1 | head -1`
last=`tcpdump -r $tcpdumpfile | grep "$port" | cut -d " " -f 1 | tail -1`
first_time=`echo $first | cut -d "." -f 1`
first_secs=`date -d $first_time +%s`
first_frac=`echo $first | cut -d "." -f 2`
last_time=`echo $last | cut -d "." -f 1`
last_secs=`date -d $last_time +%s`
last_frac=`echo $last | cut -d "." -f 2`
if [ -z $first_packet_time ]; then
first_packet_time=$first_secs.$first_frac
fi
first_display=`echo $first_secs.$first_frac - $first_packet_time | bc`
last_display=`echo $last_secs.$last_frac - $first_packet_time | bc`
total=`echo $last_secs.$last_frac - $first_secs.$first_frac | bc`
printf "%-7d %5d %11.2f %11.2f %11.2f\n" $port $packets $first_display $last_display $total
i=`expr $i + 1`
done
echo -e "\nTotal number of connections: $i"
Note that your version of tcpdump may give slightly different output, which would require you to modify the separators used in the cut command.
The output shows a single record for each head request. We show the port of the request; the number of packets in the request; the number of seconds after the start in which the first and last packets in a connection are seen; the total time between the first and last packets in the connection.
If you see just one packet in a request then the TCP handshake has failed. If you see approximately 7 packets then it is likely that the search appliance sent the request but the web server didn't respond.
Here is some example output for a single search query against a slow web server. We are using the default hostload of 4 with a request timeout of 5 seconds.
Port Packets First packet Last packet Total time 54083 11 0.00 11.78 11.78 54084 11 0.00 7.66 7.66 54085 11 0.00 11.60 11.60 54086 11 0.00 11.12 11.12 54089 8 5.01 10.02 5.01 54090 13 5.01 14.68 9.67 54091 12 5.01 14.77 9.76 ... 54452 7 125.40 130.41 5.01 54456 5 130.33 130.55 0.22 54457 5 130.33 130.55 0.22 54458 5 130.39 130.55 0.16 54459 5 130.41 130.55 0.14 Total number of connections: 120
The above analysis does not give any information on the number of cache hits, the number of requests that were not needed for displaying results, or the queueing due to hostload limitations.
If you are getting lots of unexplained timeouts, you should check that there are no network errors, such as excessive collisions, that could indicate a possible speed/duplex mismatch.