My favorites | English | Sign in

Google Search Appliance

Configuring Distributed Crawling and Serving and Index Replication BETA

Google Search Appliance software version 6.0
Posted June, 2009
Revised September, 2009: Added information about the failure of a primary node in an index replication configuration.

This guide contains the information you need to use distributed crawling and serving (also called distributed crawling) and index replication, two features of the Google Search Appliance.

  • Distributed crawling and serving is a scalability feature in which several search appliances are configured to act as though they are a single search appliance, which greatly increases the number of documents that can be crawled and served.
  • Index replication is a feature enabling the index on one search appliance, called a primary node, to be replicated to another search appliance, called a replica, providing easily-enabled high-availability serving in an automatic active/passive configuration.

You can configure distributed crawling and serving and index replication together or separately.

These are beta features. They are not supported by Google at this time.

This document is for you if you are a search appliance administrator, network administrator, or another person who configures search appliances or networks. You need to be familiar with the Google Search Appliance and how to configure crawl, serve, and other features.

Note that on the Admin Console, distributed crawling and index replication are configured under Admin Console > Multibox. In the help system on the Google Search Appliance, the features are referred to as multibox features. Distributed crawling and serving is called multibox collaboration. Index replication is called multibox replication.

Warning: Do not create both a dynamic scalability configuration and distributed crawling or index replication on the same set of search appliances. Configure dynamic scalability or distributed crawling and index replication.

Contents

  1. Introduction to Distributed Crawling and Index Replication
    1. About Distributed Crawling
    2. About Index Replication
      1. When the Primary Search Appliance Fails
    3. Combining Distributed Crawling and Index Replication
  2. About Security
  3. Before You Configure Distributed Crawling or Index Replication
  4. Configuring Distributed Crawling and Index Replication
    1. Distributed Crawling
    2. Index Replication
  5. Adding or Deleting Nodes

Introduction to Distributed Crawling and Index Replication

Distributed crawling and serving and index replication are Google Search Appliance features that expand the search appliance's capacity in different ways.

  • Distributed crawling and serving is a scalability feature in which several search appliances are configured to act as though they are a single search appliance, which greatly increases the number of documents that can be crawled and served. After distributed crawling is enabled, all crawling, indexing, and serving are configured on one search appliance, called the admin master.

    For example, if you have four search appliances that are each licensed to crawl 10 million documents, you can crawl a total of 40 million documents by putting the search appliances in a distributed crawling configuration. After distributed crawling is enabled, all crawling, indexing, and serving are configured on one search appliance, called the master search appliance.

  • Index replication is a feature enabling the index on one search appliance to be replicated to one or more search appliances, providing easily-enabled high-availability serving in an automatic active/passive configuration.

    After index replication is enabled, all crawling is automatically performed by the primary search appliance, limiting the load on content servers. Configuration information and the index are replicated to the other search appliances in the configuration. All search appliances serve from the same index and corpus.

You can use distributed crawling and index replication separately or together. Distributed crawling and index replication both work best when latency on your network is low and bandwidth is high, for example, when all search appliances are in the same data center.

About Distributed Crawling

In the following diagram, four search appliances are configured with distributed crawling. Each search appliance is designated as a particular shard in the distributed crawling configuration. Shard 0 is the master search appliance. The shard number is incremented by 1 for each additional search appliance in the configuration. The distributed crawling configuration is created on the master and the settings are exported in a configuration file. The configuration file is uploaded to Shard 1, Shard 2, and Shard 3. After the configuration file is uploaded, all search appliance features are configured on the master. The crawl is distributed among the search appliances and a single index is created. Each search appliance is considered a primary (non-replica) search appliance. All of the search appliances can serve results. The results for a search query will be identical regardless of which search appliance serves the results.

Graphic showing four search appliances in a distributed crawling configuration, shard 0 through shard 3.

After the distributed crawl configuration is set up, the four search appliances behave as if they are a single search appliance. Crawling, serving, collections, front ends, and other features are configured on Shard 0, the master node of the configuration. Feeds are sent only to the admin master. The crawl process is automatically distributed among the four search appliances. Any of the nodes can serve results. Each search appliance in the distributed crawl configuration communicates with all of the other search appliances. The diagram above does not show each of the connections between search appliances.

About Index Replication

In the following diagram, three search appliances are configured for index replication. A search appliance designated as a primary search appliance has two replicas, Replica 1 and Replica 2. After index replication is enabled, all configuration takes place on the primary search appliance. Configuration information and index data are replicated automatically and continuously to the replicas. After all configuration information is replicated, the new index data is loaded for serving.

Only the primary search appliance crawls the content, minimizing the load on content servers and the network. Each search appliance serves search results. Use a load balancer to distribute search queries among the primary search appliance and the replicas in the configuration.

Graphic showing index replication, with one master whose index is replicated to two other search appliances.

The most common index replication configuration is with one replica only.

When the Primary Search Appliance Fails

When the primary search appliance in an index replication configuration fails, the replica nodes continue to serve results. You can manually configure a replica node as the primary node. The new primary node resumes crawling and starts from where the failed primary node stopped.

Combining Distributed Crawling and Index Replication

You can enable distributed crawling and index replication on the same group of search appliances. The following illustration shows a four-node distributed crawling configuration where each node has a replica.

Four-node distributed crawling configuration in which each node has a replica

About Security

The Google Search Appliance uses secret tokens and private IP addresses to enforce security within a distributed crawling or index replication configuration.

The search appliances in a distributed crawling or index replication configuration authenticate each other using shared secret tokens that you provide during configuration. The shared secret tokens must consist only of printable ASCII characters.

There are no restrictions on the public IP addresses assigned to the search appliances in the configuration beyond a requirement that a search appliance is able to reach another search appliance's public IP address on port 10999.

Certain communications among the search appliances in a distributed crawling or index replication configuration are conducted over a secure private network, including search requests, search credentials transmitted as sessions, and search results that include snippets, whether the results are authorized or not authorized. When you set up a distributed crawling or index replication configuration, you provide special private network IP addresses that the search appliances use for these secure communications. On the Admin Console interface, the private network IP addresses are called multibox network IP addresses.

The following guidelines apply to the private network IP addresses:

  • You can assign or change the private IP addresses at any time.
  • The private IP addresses must be different from the IP addresses that will be crawled on your internal network. For example, if you use 10.0.0.0/8 for your intranet then you should choose the private IP addresses from the 192.168.0.0/24 network. If the 192.168.0.0/24 network is also in use, try 192.168.1.0/24 or the 172.16.0.0/12 range.
  • The private IP addresses must conform to the private address space as defined in RFC 1918 and must not overlap with any other private address space used on your network.
  • The private network addresses cannot be in the range spanning subnet /16 to /8.

Before You Configure Distributed Crawling or Index Replication

This section provides a checklist of information you need to collect and decisions you need to make before you configure distributed crawling or index replication.

Task Description Your Values
Determine which Google Search Appliance will participate in the configuration. Any Google Search Appliance model running software version 6.0 or later can participate.  
Determine the appliance IDs of the participating search appliances The appliances IDs can be found on the Admin Console under Administration > License.  
Determine the host names or public IP addresses of the search appliances in the configuration. The host names or IP addresses are required during the initial configuration process.  
Determine the network IP addresses for the search appliances. The network IP addresses, called multibox IP addresses on the Admin Console, are used for communication among the search appliances in the configuration. The network IP addresses must conform to the private address space as defined in RFC 1918 and must not overlap with any other private address space in use on your network.  
Determine whether you are configuring distributed crawling, index replication, or both. You can use the two features separately or together.  
Determine which search appliance is the master search appliance in the configuration. If you use distributed crawling, you configure crawl, search, and index on the primary search appliance. If you use index replication, the primary search appliance is the search appliance whose index is replicated to the other nodes.  
Determine the secret token that the search appliances will use to recognize each other within the configuration. The nodes in the configuration use the secret tokens to authenticate to each other. The secret token must include only printable ASCII characters. Each search appliance in a distributed crawling configuration has its own associated secret token, which you specify on the Multibox > Host Configuration page.  
If you are setting up distributed crawling, configure feeds only on the master. Feeds can only be indexed on the master.  
If you are using self-signed SSL certificates, make sure that you install the correct root certificate authorities on the Admin Console, on the page Administration -> Certificate Authorities. If the wrong root certificate authorities are installed, You see errors and results are not returned properly on secondary appliances in an index replication configuration.  

Configuring Distributed Crawling and Index Replication

Observe the following precautions in configuring distributed crawling and index replication:

  • Do not configure dynamic scalability and either distributed crawling or index replication.
  • In a distributed crawling configuration, feeds must be configured only on the admin master search appliance.

Distributed Crawling

Use these high-level instructions to configure distributed crawling. For more detailed instructions, see the online help page for Admin Console > Multibox Beta.

To configure distributed crawling:

  1. Log in to the Admin Console of the admin master node in the distributed crawling configuration.
  2. On the Multibox Beta page, define all of the shards in the configuration.
  3. After you save the settings, export the multibox configuration file.
  4. Log in to the Admin Console on the search appliance defined as the next shard in the configuration.
  5. On the Multibox Beta page, import the multibox configuration file you exported in step 3.
  6. Perform steps 4 and 5 on each additional primary shard.

Index Replication

Use these to configure index replication. The configuration you are most likely to use has one replica only.

To configure index replication:

  1. Log in to the Admin Console.
  2. Navigate to Multibox > Configuration.
  3. If multibox configuration is not enabled, type in the number of shards you want in the multibox configuration and click Enable Multibox. Shard 0 is the Admin Master for a distributed crawling configuration. All shards are primary for an index replication configuration.
  4. Click the View/Edit link for the shard where you want replication enabled.
  5. Click Add.
  6. On the drop-down list, choose Primary.
  7. Type in the primary node's GSA Appliance ID.
  8. Type in the Appliance hostname or the IP address of the search appliance.
  9. Type in the Multibox network IP of the search appliance.
  10. Type in the Secret token of this search appliance.
  11. Click Save.
  12. Click Add.
  13. On the drop-down list, choose Replica.
  14. Type in the replica node's GSA Appliance ID.
  15. Type in the Appliance hostname or the IP address of the replica search appliance.
  16. Type in the Multibox network IP of the search appliance.
  17. Type in the Secret token of this search appliance.
  18. Click Save.
  19. Click Apply Configuration.
  20. Perform steps 4 through 19 for each shard for which you want to add a primary and replica node.
  21. Export the multibox configuration to a file, using the instructions on the Help page for multibox configurations.
  22. Import the multibox configuration file on the replica search appliance or appliances, using the instructions on the Help page.

To add a replica to an existing primary node:

  1. Log in to the Admin Console.
  2. Navigate to Multibox > Configuration.
  3. Click the View/Edit link for the shard where you want replication enabled.
  4. Click Add.
  5. On the drop-down list, choose Replica.
  6. Type in the replica node's GSA Appliance ID.
  7. Type in the Appliance hostname or the IP address of the search appliance.
  8. Type in the Multibox network IP of the search appliance.
  9. Type in the Secret token of this search appliance.
  10. Click Save.
  11. Click Apply Configuration.

Add or Deleting Nodes

The easiest way to add or delete a node from a distributed crawling or index replication configuration is to navigate to the Admin Console > Multibox Configuration > Add Node page and check or uncheck the Enable checkbox. When the checkbox is unchecked, the node does not function in the configuration, but it retains its configuration information. You can add a node in disabled mode, then enable it at a later time.