shards.conf and you
open up /var/lounge/etc/shards.conf and you should see this:
{
"shard_map": [[0,1], [1,0]],
"nodes": [ ["localhost", 5984], ["localhost", 5984] ]
}This format of this file is json (hence the need for the json parsing library in the nginx module), and it determines everything about how data will be partitioned and replicated.
The structure is a bit hard to understand, but made it much easier to parse and build the shard/peer relationships in the nginx module.
There are two json objects, "shard_map" and "nodes". The shard_map object is an array of arrays that determines where a shard is primarily located and where it should replicate/failover to. So, in the example above, we'd have:
shard 0 shard 1
"shard_map": [ [0,1], [1,0] ]
The position in the array determines which shard we're talking about. At 0, we have [0,1]. This array means that shard 0 is on node 0, with replication to node 1. Additionally, in the case that node 0 becomes unreachable, proxying for this shard will fail over to node 1. You can have any number of replication partners in for a shards, depending on the amount of redundancy you want/need in your cluster.
Node 0 Node 1
"nodes": [ ["localhost", 5984], ["localhost", 5984] ]
The nodes array gives the information about the CouchDB nodes themselves. Pretty straightforward, each array entry is another 2 element array of [<hostname>, <port>].
So, that means that our shards.conf has defined a cluster that has 2 shards, each replicating to the other, spread across two nodes (which happen to be pointing at the same couch instance, but could just as easily be pointing at other nodes).
shards.conf generator
Here's a python script that will generate a shards.conf for you. You provide it with a list of hostnames, the desired number of shards, and the desired level of redundancy; it will allocate shards even among the nodes. Of course this is not the only way to allocate shards!
#!/usr/bin/python
import simplejson
import sys
def next(i, lst):
"""Increment a list index with wraparound."""
return (i+1)%len(lst)
def remove_dupes(lst):
"""Remove duplicates from a list, not preserving order."""
return dict([(el, 1) for el in lst]).keys()
def has_dupes(lst):
"""Determine if a list contains any duplicate entries."""
return len(lst)!=len(remove_dupes(lst))
def validate(node_map, nodes):
"""Check if a node map is acceptable:
1. Nodes should all have approximately the same number of primary shards.
2. A shard should not be replicated to the same node twice.
This is just a sanity check. If the map generating algorithm is valid,
this function should always succeed.
"""
shard_count = [0 for node in nodes]
for map in node_map:
# condition 2: no repeated nodes in replication map for a given shard
assert (not has_dupes(map)), "Shard %s has duplicates in its replication map" % shard
primary = map[0]
shard_count[primary] += 1
# condition 1: check that shards are distributed evenly
min, max = 9999999, -1
for count in shard_count:
if count < min:
min = count
if count > max:
max = count
spread = max - min
assert (0 <= spread and spread <= 1), "Shards are not distributed evenly"
def main(nodefile, num_shards, redundancy):
num_shards = int(num_shards)
redundancy = int(redundancy)
shards = range(num_shards)
nodes = file(nodefile).read().split()
# add port to nodes
nodes = [(node, 5984) for node in nodes if len(node)>0]
nodes.sort()
assert redundancy < len(nodes), "You can't have n+%d redundancy with %d nodes." % (redundancy, len(nodes))
# assign shards to nodes
node_map = [[] for shard in shards]
node_i = 0
# shard1: A, B, C, D ...
# shard2: B, C, D, E ...
# and so, wrapping around
for shard in shards:
# node_i is the primary
node_map[shard].append(node_i)
# node_j is the secondaries
node_j = next(node_i, nodes)
for k in range(redundancy):
node_map[shard].append(node_j)
node_j = next(node_j, nodes)
node_i = next(node_i, nodes)
validate(node_map, nodes)
print simplejson.dumps(dict(shard_map=node_map, nodes=nodes))
if __name__=='__main__':
if len(sys.argv)!=4:
print "Usage: update_shard_map <node list file> <number of shards> <level of redundancy>"
sys.exit(1)
try:
main(*sys.argv[1:4])
except AssertionError, e:
print "Replication map is invalid:", str(e)
sys.exit(1)
except:
raiseExamples
kevin@headcheese [~/couchdb-lounge]$ cat nodelist
lounge101
lounge102
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist 4 0 # 4 shards, no redundancy
{"nodes": [["lounge101", 5984], ["lounge102", 5984]], "shard_map": [[0], [1], [0], [1]]}
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist 8 1 # 8 shards, one slave for each shard
{"nodes": [["lounge101", 5984], ["lounge102", 5984]],
"shard_map": [[0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0]]}
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist 9 1 # 9 shards, can't be split evenly
{"nodes": [["lounge101", 5984], ["lounge102", 5984]],
"shard_map": [[0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1], [1, 0], [0, 1]]}
kevin@headcheese [~/couchdb-lounge]$ cat nodelist2
lounge101
lounge102
lounge103
lounge104
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist2 8 1 # 8 shards, +1 redundancy
{"nodes": [["lounge101", 5984], ["lounge102", 5984], ["lounge103", 5984], ["lounge104", 5984]],
"shard_map": [[0, 1], [1, 2], [2, 3], [3, 0], [0, 1], [1, 2], [2, 3], [3, 0]]}
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist2 8 2 # 8 shards, +2 redundancy
{"nodes": [["lounge101", 5984], ["lounge102", 5984], ["lounge103", 5984], ["lounge104", 5984]],
"shard_map": [[0, 1, 2], [1, 2, 3], [2, 3, 0], [3, 0, 1], [0, 1, 2], [1, 2, 3], [2, 3, 0], [3, 0, 1]]}
kevin@headcheese [~/couchdb-lounge]$ cat nodelist3
localhost
kevin@headcheese [~/couchdb-lounge]$ python update_shard_map.py nodelist3 3 0 # 3 shards on localhost, no redundancy
{"nodes": [["localhost", 5984]], "shard_map": [[0], [0], [0]]}
Does this mean that if I have 3 nodes on separate servers that I would have: {
} Can you please clarify? ThanksI'm very new to lounge, but I would think it would be more like
{ "shard_map": [0,1,2], [1,2,0], [2,0,1], "nodes": ["server1", 5984, ["server2", 5984] , ["server3", 5984]] }What I'm trying to understand however is the replication part. Does lounge handle all this? If a node is down and fail-over kicks in, what happens when the node comes back up?
Additional question: what happens if I alter the shards.conf file? If I add/remove shards? Does the data get repartitioned automagically? Or is this not (yet?) supported?
@primary: somers.tim's example is pretty close to being correct (he's just missing some surrounding brackets making the shard_map entry an array of arrays). In his example, your database would be split in to three shards. Shard 0 would primarily live on node 0, but be replicated over to node 1 and node 2. Shard 1 would primarily live on node 1 and be replicated to nodes 2 and 0. You could have more shards than 3 (which makes sense if you want to expand the cluster later) and you can also have more or less replication.
@somers.tim: if a node has been down for a while (resulting in changes to the secondary node), you'd need to have the secondary node replicate over to the primary when it comes back up. At the moment, this would have to be done manually. Once the node is reachable again, the proxies will start directing requests to it without you having to do anything.
If you alter the shards.conf file, nothing will change until you tell the dumbproxy to reload (nginx does zero-downtime config reloads), after which requests will use your new shard configuration. Restarting smartproxy will do the same thing.
The trickier part is handling the underlying data. Adding and removing shards is pretty ugly -- you're better off presharding aggressively (we started with 48 shards for a 6 node cluster) and then moving the shards around as your cluster grows. As it stands, we don't do anything about resharding, although it's definitely possible (though very expensive) to do that. The way we're doing the hashing, adding another shard would cause documents to get moved from one shard to another as the modulus changes.
At the moment, there's no plan to handle auto repartitioning, but if someone wanted to code it, that would be awesome.
@srlindsay: It would be great to see your more complex example with 48 shards to illustrate. I am considering couchdb lounge and it would be helpful to see something closer to a production setup since I can't see value in sharding on same machine. I would appreciate if you could elaborate on your hashing scheme. This is an important consideration as adjustments need to be made.
@fairwinds.dp:
{"shard_map": [ [0,5], [1,5], [2,4], [3,4], [4,5], [4,5], [0,4], [1,4], [2,5], [3,5], [5,4], [5,4], [0,5], [1,5], [2,4], [3,4], [4,5], [4,5], [0,4], [1,4], [2,5], [3,5], [5,4], [5,4], [0,5], [1,5], [2,4], [3,4], [4,5], [4,5], [0,4], [1,4], [2,5], [3,5], [5,4], [5,4], [0,5], [1,5], [2,4], [3,4], [4,5], [4,5], [0,4], [1,4], [2,5], [3,5], [5,4], [5,4]], "nodes": [ ["lounge101",5984],["lounge102",5984], ["lounge103",5984],["lounge104",5984], ["lounge105",5984],["lounge106",5984] ]}Here's how the hashing works: the document name requested is hashed using nginx's CRC32 function, then modded by the number of shards to determine which shard that document should live on. Once the shard is identified, nginx will try to proxy to the hosts in the listed in the entry for that shard.
@fairwinds.dp: So, if your document hashes to, say, 50. 50 % 48 == 2, so that document would live on shard 2. The shard map entry for shard 2 is [2,4], meaning the shard is assigned to lounge103 and lounge105, so those are the hosts that we'll proxy to.
i don't really understand how would work adding a node on the fly ? Does it mean you preallocate a number of shard then findinding docids for this shared and adding it to the machine ? What happens when number of preallocate shared isn't enough ?
@bchesneau, here's how you would add a new node on the fly.
Let's say we have two nodes originally, lounge101 and lounge102, and we started with 6 shards. So lounge101 hosts shards 0, 2, 4 and lounge102 hosts shards 1, 3, 5.
We want to add a new node lounge103, which will be the new master for shards 4 and 5. Here's the process:
1. Create http://lounge103:5984/mydb4 and http://lounge103:5984/mydb5
2. Replicate http://lounge101:5984/mydb4 to http://lounge103:5984/mydb4
3. Replicate http://lounge102:5984/mydb5 to http://lounge103:5984/mydb5 (repeat for every database you have on the lounge)
4. Update shards.conf and synchronize to all proxies (dumbproxy and smartproxy)
5. Reload nginx config with HUP. Now lounge103 is the master for shards 4 and 5.
6. Repeat the replication in steps 2 and 3. This is to catch any updates that happened between steps 3 and 5.
We have a script to automate all these steps, which I'll add here. We have done this in production without taking down the database and it works quite seamlessly :)
Changing the number of preallocated shards is trickier. Now that replication is based on changes, we think it will be pretty easy to implement replication filtered by hash key. Then you would be able to replicate a lounge with (let's say) 10 shards to a lounge with 15 shards. That's just an idea at this point though; we haven't tried to implement it yet.
@bchesneau: like kevin said, adding more shards is tricky, since we mod the hash of the key by the number of shards to figure out which node it lives on. The best plan right now is to aggressively pre-shard. Pick a number of shards greater than the number of machines you think you'll likely ever need for your cluster. For our production cluster, we're using 48 shards. We originally had 2 servers, but have since expanded to 6. Adding nodes and moving shards between nodes is fairly easy -- adding more shards, however, is really hard.
Would it be possible to do the presharding based on Virtual IP addresses?
e.g. define 48 shards on virtual IPs and then have the cluster watchdog spread those 48 IPs over the available machines and do shard moving etc? (virtual IPs being implemented by ARP spoofing)
How would lounge react if something it thinks are two servers with 2 different names are in fact the same?
@fforw: How would lounge react if something it thinks are two servers with 2 different names are in fact the same? I think this should just work-- smartproxy doesn't make any assumptions about the nodes based on their hostnames.
Just in case anyone runs into the same issue - this script doesn't support ports in the nodelist.
In case anyone is interested in this.
I "forked" the update_shard_map.py and thanks to David, we now got:
(I used my script lounge-shard-conf.sh to generate it.)
The original format will work as well:
(Default port 5984 will be assigned to those.)
Check it out: http://github.com/till/ubuntu/blob/master/couchdb/couchdb-lounge/update_shard_map.py
I have a bunch of scripts revolving around the lounge setup on Github: http://github.com/till/ubuntu/tree/master/couchdb/couchdb-lounge