My favorites | Sign in
Project Home Downloads Wiki Issues Source
Search
for
ShardsConfandYou  
Explanation of how the shards.conf file works
Updated Aug 22, 2009 by kevin.a....@gmail.com

shards.conf and you

open up /var/lounge/etc/shards.conf and you should see this:

{
	"shard_map": [[0,1], [1,0]],
	"nodes": [ ["localhost", 5984], ["localhost", 5984] ]
}

This format of this file is json (hence the need for the json parsing library in the nginx module), and it determines everything about how data will be partitioned and replicated.

The structure is a bit hard to understand, but made it much easier to parse and build the shard/peer relationships in the nginx module.

There are two json objects, "shard_map" and "nodes". The shard_map object is an array of arrays that determines where a shard is primarily located and where it should replicate/failover to. So, in the example above, we'd have:

                 shard 0    shard 1
"shard_map": [    [0,1],     [1,0]     ]

The position in the array determines which shard we're talking about. At 0, we have [0,1]. This array means that shard 0 is on node 0, with replication to node 1. Additionally, in the case that node 0 becomes unreachable, proxying for this shard will fail over to node 1. You can have any number of replication partners in for a shards, depending on the amount of redundancy you want/need in your cluster.

                 Node 0                Node 1
"nodes": [ ["localhost", 5984], ["localhost", 5984] ]

The nodes array gives the information about the CouchDB nodes themselves. Pretty straightforward, each array entry is another 2 element array of [<hostname>, <port>].

So, that means that our shards.conf has defined a cluster that has 2 shards, each replicating to the other, spread across two nodes (which happen to be pointing at the same couch instance, but could just as easily be pointing at other nodes).

shards.conf generator

Here's a python script that will generate a shards.conf for you. You provide it with a list of hostnames, the desired number of shards, and the desired level of redundancy; it will allocate shards even among the nodes. Of course this is not the only way to allocate shards!

#!/usr/bin/python

import simplejson
import sys

def next(i, lst):
        """Increment a list index with wraparound."""
        return (i+1)%len(lst)

def remove_dupes(lst):
        """Remove duplicates from a list, not preserving order."""
        return dict([(el, 1) for el in lst]).keys()

def has_dupes(lst):
        """Determine if a list contains any duplicate entries."""
        return len(lst)!=len(remove_dupes(lst))

def validate(node_map, nodes):
        """Check if a node map is acceptable:

        1. Nodes should all have approximately the same number of primary shards.
        2. A shard should not be replicated to the same node twice.

        This is just a sanity check.  If the map generating algorithm is valid,
        this function should always succeed.
        """
        shard_count = [0 for node in nodes]
        for map in node_map:
                # condition 2: no repeated nodes in replication map for a given shard
                assert (not has_dupes(map)), "Shard %s has duplicates in its replication map" % shard
                primary = map[0]
                shard_count[primary] += 1
        
        # condition 1: check that shards are distributed evenly 
        min, max = 9999999, -1
        for count in shard_count:
                if count < min:
                        min = count
                if count > max:
                        max = count
        spread = max - min
        assert (0 <= spread and spread <= 1), "Shards are not distributed evenly"

def main(nodefile, num_shards, redundancy):
        num_shards = int(num_shards)
        redundancy = int(redundancy)
        shards = range(num_shards)
        nodes = file(nodefile).read().split()
        # add port to nodes
        nodes = [(node, 5984) for node in nodes if len(node)>0]
        nodes.sort()

        assert redundancy < len(nodes), "You can't have n+%d redundancy with %d nodes." % (redundancy, len(nodes))

        # assign shards to nodes
        node_map = [[] for shard in shards]
        node_i = 0
        # shard1: A, B, C, D ...
        # shard2: B, C, D, E ...
        # and so, wrapping around
        for shard in shards:
                # node_i is the primary 
                node_map[shard].append(node_i)

                # node_j is the secondaries
                node_j = next(node_i, nodes)
                for k in range(redundancy):
                        node_map[shard].append(node_j)
                        node_j = next(node_j, nodes)

                node_i = next(node_i, nodes)
        
        validate(node_map, nodes)
        print simplejson.dumps(dict(shard_map=node_map, nodes=nodes))

if __name__=='__main__':
        if len(sys.argv)!=4:
                print "Usage: update_shard_map <node list file> <number of shards> <level of redundancy>"
                sys.exit(1)
        try:
                main(*sys.argv[1:4])
        except AssertionError, e:
                print "Replication map is invalid:", str(e)
                sys.exit(1)
        except:
                raise

Examples

kevin@headcheese [~/couchdb-lounge]$ cat nodelist 
lounge101
lounge102
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist 4 0 # 4 shards, no redundancy
{"nodes": [["lounge101", 5984], ["lounge102", 5984]], "shard_map": [[0], [1], [0], [1]]}
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist 8 1 # 8 shards, one slave for each shard
{"nodes": [["lounge101", 5984], ["lounge102", 5984]],
"shard_map": [[0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0]]}
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist 9 1 # 9 shards, can't be split evenly
{"nodes": [["lounge101", 5984], ["lounge102", 5984]],
"shard_map": [[0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1]]}
kevin@headcheese [~/couchdb-lounge]$ cat nodelist2
lounge101
lounge102
lounge103
lounge104
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist2 8 1 # 8 shards, +1 redundancy
{"nodes": [["lounge101", 5984], ["lounge102", 5984], ["lounge103", 5984], ["lounge104", 5984]],
"shard_map": [[0, 1], [1, 2], [2, 3], [3, 0], [0, 1], [1, 2], [2, 3], [3, 0]]}
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist2 8 2 # 8 shards, +2 redundancy
{"nodes": [["lounge101", 5984], ["lounge102", 5984], ["lounge103", 5984], ["lounge104", 5984]],
"shard_map": [[0, 1, 2], [1, 2, 3], [2, 3, 0], [3, 0, 1], [0, 1, 2], [1, 2, 3], [2, 3, 0], [3, 0, 1]]}
kevin@headcheese [~/couchdb-lounge]$ cat nodelist3
localhost
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist3 3 0 # 3 shards on localhost, no redundancy
{"nodes": [["localhost", 5984]], "shard_map": [[0], [0], [0]]}
Comment by primary....@gmail.com, Jul 16, 2009

Does this mean that if I have 3 nodes on separate servers that I would have: {

"shard_map": [[0,0,1], [0,1,0],[1,0,0]], "nodes": ["server1", 5984, ["server2", 5984] , ["server3", 5984]]
} Can you please clarify? Thanks

Comment by somers....@gmail.com, Aug 4, 2009

I'm very new to lounge, but I would think it would be more like

{
"shard_map": [0,1,2], [1,2,0], [2,0,1],
"nodes": ["server1", 5984, ["server2", 5984] , ["server3", 5984]] 
}

What I'm trying to understand however is the replication part. Does lounge handle all this? If a node is down and fail-over kicks in, what happens when the node comes back up?

Comment by somers....@gmail.com, Aug 4, 2009

Additional question: what happens if I alter the shards.conf file? If I add/remove shards? Does the data get repartitioned automagically? Or is this not (yet?) supported?

Comment by project member srlind...@gmail.com, Aug 14, 2009

@primary: somers.tim's example is pretty close to being correct (he's just missing some surrounding brackets making the shard_map entry an array of arrays). In his example, your database would be split in to three shards. Shard 0 would primarily live on node 0, but be replicated over to node 1 and node 2. Shard 1 would primarily live on node 1 and be replicated to nodes 2 and 0. You could have more shards than 3 (which makes sense if you want to expand the cluster later) and you can also have more or less replication.

@somers.tim: if a node has been down for a while (resulting in changes to the secondary node), you'd need to have the secondary node replicate over to the primary when it comes back up. At the moment, this would have to be done manually. Once the node is reachable again, the proxies will start directing requests to it without you having to do anything.

If you alter the shards.conf file, nothing will change until you tell the dumbproxy to reload (nginx does zero-downtime config reloads), after which requests will use your new shard configuration. Restarting smartproxy will do the same thing.

The trickier part is handling the underlying data. Adding and removing shards is pretty ugly -- you're better off presharding aggressively (we started with 48 shards for a 6 node cluster) and then moving the shards around as your cluster grows. As it stands, we don't do anything about resharding, although it's definitely possible (though very expensive) to do that. The way we're doing the hashing, adding another shard would cause documents to get moved from one shard to another as the modulus changes.

At the moment, there's no plan to handle auto repartitioning, but if someone wanted to code it, that would be awesome.

Comment by fairwind...@gmail.com, Aug 16, 2009

@srlindsay: It would be great to see your more complex example with 48 shards to illustrate. I am considering couchdb lounge and it would be helpful to see something closer to a production setup since I can't see value in sharding on same machine. I would appreciate if you could elaborate on your hashing scheme. This is an important consideration as adjustments need to be made.

Comment by project member srlind...@gmail.com, Aug 16, 2009

@fairwinds.dp:

{"shard_map": [
[0,5], [1,5], [2,4], [3,4], [4,5], [4,5],
[0,4], [1,4], [2,5], [3,5], [5,4], [5,4],
[0,5], [1,5], [2,4], [3,4], [4,5], [4,5],
[0,4], [1,4], [2,5], [3,5], [5,4], [5,4],
[0,5], [1,5], [2,4], [3,4], [4,5], [4,5],
[0,4], [1,4], [2,5], [3,5], [5,4], [5,4],
[0,5], [1,5], [2,4], [3,4], [4,5], [4,5],
[0,4], [1,4], [2,5], [3,5], [5,4], [5,4]],
"nodes": [
["lounge101",5984],["lounge102",5984],
["lounge103",5984],["lounge104",5984],
["lounge105",5984],["lounge106",5984]
]}

Here's how the hashing works: the document name requested is hashed using nginx's CRC32 function, then modded by the number of shards to determine which shard that document should live on. Once the shard is identified, nginx will try to proxy to the hosts in the listed in the entry for that shard.

Comment by project member srlind...@gmail.com, Aug 16, 2009

@fairwinds.dp: So, if your document hashes to, say, 50. 50 % 48 == 2, so that document would live on shard 2. The shard map entry for shard 2 is [2,4], meaning the shard is assigned to lounge103 and lounge105, so those are the hosts that we'll proxy to.

Comment by bchesn...@gmail.com, Sep 13, 2009

i don't really understand how would work adding a node on the fly ? Does it mean you preallocate a number of shard then findinding docids for this shared and adding it to the machine ? What happens when number of preallocate shared isn't enough ?

Comment by project member kevin.a....@gmail.com, Sep 16, 2009

@bchesneau, here's how you would add a new node on the fly.

Let's say we have two nodes originally, lounge101 and lounge102, and we started with 6 shards. So lounge101 hosts shards 0, 2, 4 and lounge102 hosts shards 1, 3, 5.

We want to add a new node lounge103, which will be the new master for shards 4 and 5. Here's the process:

1. Create http://lounge103:5984/mydb4 and http://lounge103:5984/mydb5

2. Replicate http://lounge101:5984/mydb4 to http://lounge103:5984/mydb4

3. Replicate http://lounge102:5984/mydb5 to http://lounge103:5984/mydb5 (repeat for every database you have on the lounge)

4. Update shards.conf and synchronize to all proxies (dumbproxy and smartproxy)

5. Reload nginx config with HUP. Now lounge103 is the master for shards 4 and 5.

6. Repeat the replication in steps 2 and 3. This is to catch any updates that happened between steps 3 and 5.

We have a script to automate all these steps, which I'll add here. We have done this in production without taking down the database and it works quite seamlessly :)

Changing the number of preallocated shards is trickier. Now that replication is based on changes, we think it will be pretty easy to implement replication filtered by hash key. Then you would be able to replicate a lounge with (let's say) 10 shards to a lounge with 15 shards. That's just an idea at this point though; we haven't tried to implement it yet.

Comment by project member srlind...@gmail.com, Sep 16, 2009

@bchesneau: like kevin said, adding more shards is tricky, since we mod the hash of the key by the number of shards to figure out which node it lives on. The best plan right now is to aggressively pre-shard. Pick a number of shards greater than the number of machines you think you'll likely ever need for your cluster. For our production cluster, we're using 48 shards. We originally had 2 servers, but have since expanded to 6. Adding nodes and moving shards between nodes is fairly easy -- adding more shards, however, is really hard.

Comment by ff...@gmx.de, Oct 18, 2009

Would it be possible to do the presharding based on Virtual IP addresses?

e.g. define 48 shards on virtual IPs and then have the cluster watchdog spread those 48 IPs over the available machines and do shard moving etc? (virtual IPs being implemented by ARP spoofing)

How would lounge react if something it thinks are two servers with 2 different names are in fact the same?

Comment by project member kevin.a....@gmail.com, Oct 18, 2009

@fforw: How would lounge react if something it thinks are two servers with 2 different names are in fact the same? I think this should just work-- smartproxy doesn't make any assumptions about the nodes based on their hostnames.

Comment by klimp...@gmail.com, Feb 16, 2010

Just in case anyone runs into the same issue - this script doesn't support ports in the nodelist.

Comment by klimp...@gmail.com, Feb 16, 2010

In case anyone is interested in this.

I "forked" the update_shard_map.py and thanks to David, we now got:

shell$ cat nodelist
node1 5984
node1 5985
node1 5986
node1 5987
...

(I used my script lounge-shard-conf.sh to generate it.)

The original format will work as well:

shell$ cat nodelist
node1
node2
node3

(Default port 5984 will be assigned to those.)

Check it out: http://github.com/till/ubuntu/blob/master/couchdb/couchdb-lounge/update_shard_map.py

I have a bunch of scripts revolving around the lounge setup on Github: http://github.com/till/ubuntu/tree/master/couchdb/couchdb-lounge


Sign in to add a comment
Powered by Google Project Hosting