Google Search Appliance software version 6.0
Posted June, 2009
Revised September, 2009: Added information about the failure of a primary node in an index replication configuration.
This guide contains the information you need to use distributed crawling and serving (also called distributed crawling) and index replication, two features of the Google Search Appliance.
You can configure distributed crawling and serving and index replication together or separately.
These are beta features. They are not supported by Google at this time.
This document is for you if you are a search appliance administrator, network administrator, or another person who configures search appliances or networks. You need to be familiar with the Google Search Appliance and how to configure crawl, serve, and other features.
Note that on the Admin Console, distributed crawling and index replication are configured under Admin Console > Multibox. In the help system on the Google Search Appliance, the features are referred to as multibox features. Distributed crawling and serving is called multibox collaboration. Index replication is called multibox replication.
Warning: Do not create both a dynamic scalability configuration and distributed crawling or index replication on the same set of search appliances. Configure dynamic scalability or distributed crawling and index replication.
Distributed crawling and serving and index replication are Google Search Appliance features that expand the search appliance's capacity in different ways.
For example, if you have four search appliances that are each licensed to crawl 10 million documents, you can crawl a total of 40 million documents by putting the search appliances in a distributed crawling configuration. After distributed crawling is enabled, all crawling, indexing, and serving are configured on one search appliance, called the master search appliance.
After index replication is enabled, all crawling is automatically performed by the primary search appliance, limiting the load on content servers. Configuration information and the index are replicated to the other search appliances in the configuration. All search appliances serve from the same index and corpus.
You can use distributed crawling and index replication separately or together. Distributed crawling and index replication both work best when latency on your network is low and bandwidth is high, for example, when all search appliances are in the same data center.
In the following diagram, four search appliances are configured with distributed crawling. Each search appliance is designated as a particular shard in the distributed crawling configuration. Shard 0 is the master search appliance. The shard number is incremented by 1 for each additional search appliance in the configuration. The distributed crawling configuration is created on the master and the settings are exported in a configuration file. The configuration file is uploaded to Shard 1, Shard 2, and Shard 3. After the configuration file is uploaded, all search appliance features are configured on the master. The crawl is distributed among the search appliances and a single index is created. Each search appliance is considered a primary (non-replica) search appliance. All of the search appliances can serve results. The results for a search query will be identical regardless of which search appliance serves the results.

After the distributed crawl configuration is set up, the four search appliances behave as if they are a single search appliance. Crawling, serving, collections, front ends, and other features are configured on Shard 0, the master node of the configuration. Feeds are sent only to the admin master. The crawl process is automatically distributed among the four search appliances. Any of the nodes can serve results. Each search appliance in the distributed crawl configuration communicates with all of the other search appliances. The diagram above does not show each of the connections between search appliances.
In the following diagram, three search appliances are configured for index replication. A search appliance designated as a primary search appliance has two replicas, Replica 1 and Replica 2. After index replication is enabled, all configuration takes place on the primary search appliance. Configuration information and index data are replicated automatically and continuously to the replicas. After all configuration information is replicated, the new index data is loaded for serving.
Only the primary search appliance crawls the content, minimizing the load on content servers and the network. Each search appliance serves search results. Use a load balancer to distribute search queries among the primary search appliance and the replicas in the configuration.

The most common index replication configuration is with one replica only.
When the primary search appliance in an index replication configuration fails, the replica nodes continue to serve results. You can manually configure a replica node as the primary node. The new primary node resumes crawling and starts from where the failed primary node stopped.
You can enable distributed crawling and index replication on the same group of search appliances. The following illustration shows a four-node distributed crawling configuration where each node has a replica.

The Google Search Appliance uses secret tokens and private IP addresses to enforce security within a distributed crawling or index replication configuration.
The search appliances in a distributed crawling or index replication configuration authenticate each other using shared secret tokens that you provide during configuration. The shared secret tokens must consist only of printable ASCII characters.
There are no restrictions on the public IP addresses assigned to the search appliances in the configuration beyond a requirement that a search appliance is able to reach another search appliance's public IP address on port 10999.
Certain communications among the search appliances in a distributed crawling or index replication configuration are conducted over a secure private network, including search requests, search credentials transmitted as sessions, and search results that include snippets, whether the results are authorized or not authorized. When you set up a distributed crawling or index replication configuration, you provide special private network IP addresses that the search appliances use for these secure communications. On the Admin Console interface, the private network IP addresses are called multibox network IP addresses.
The following guidelines apply to the private network IP addresses:
This section provides a checklist of information you need to collect and decisions you need to make before you configure distributed crawling or index replication.
| Task | Description | Your Values |
|---|---|---|
| Determine which Google Search Appliance will participate in the configuration. | Any Google Search Appliance model running software version 6.0 or later can participate. | |
| Determine the appliance IDs of the participating search appliances | The appliances IDs can be found on the Admin Console under Administration > License. | |
| Determine the host names or public IP addresses of the search appliances in the configuration. | The host names or IP addresses are required during the initial configuration process. | |
| Determine the network IP addresses for the search appliances. | The network IP addresses, called multibox IP addresses on the Admin Console, are used for communication among the search appliances in the configuration. The network IP addresses must conform to the private address space as defined in RFC 1918 and must not overlap with any other private address space in use on your network. | |
| Determine whether you are configuring distributed crawling, index replication, or both. | You can use the two features separately or together. | |
| Determine which search appliance is the master search appliance in the configuration. | If you use distributed crawling, you configure crawl, search, and index on the primary search appliance. If you use index replication, the primary search appliance is the search appliance whose index is replicated to the other nodes. | |
| Determine the secret token that the search appliances will use to recognize each other within the configuration. | The nodes in the configuration use the secret tokens to authenticate to each other. The secret token must include only printable ASCII characters. Each search appliance in a distributed crawling configuration has its own associated secret token, which you specify on the Multibox > Host Configuration page. | |
| If you are setting up distributed crawling, configure feeds only on the master. | Feeds can only be indexed on the master. | |
| If you are using self-signed SSL certificates, make sure that you install the correct root certificate authorities on the Admin Console, on the page Administration -> Certificate Authorities. | If the wrong root certificate authorities are installed, You see errors and results are not returned properly on secondary appliances in an index replication configuration. |
Observe the following precautions in configuring distributed crawling and index replication:
Use these high-level instructions to configure distributed crawling. For more detailed instructions, see the online help page for Admin Console > Multibox Beta.
To configure distributed crawling:
Use these to configure index replication. The configuration you are most likely to use has one replica only.
To configure index replication:
To add a replica to an existing primary node:
The easiest way to add or delete a node from a distributed crawling or index replication configuration is to navigate to the Admin Console > Multibox Configuration > Add Node page and check or uncheck the Enable checkbox. When the checkbox is unchecked, the node does not function in the configuration, but it retains its configuration information. You can add a node in disabled mode, then enable it at a later time.