Export to GitHub

weed-fs - issue #51

Assign on empty cluster sometime fails under high concurrency load


Posted on Oct 29, 2013 by Happy Horse

What steps will reproduce the problem? 1. launch an empty weed master server eg: "weed master -mdir=." 2. launch an empty weed volume server eg: "weed volume -port=7301 -dir=. -mserver=10.9.2.72:9333 -ip=homer -readTimeout=20 3. launch a lot of concurrent assign request in parallel (eg, 10 curls launched with clusterssh from the volume server)

What is the expected output? What do you see instead?

All assign requests should return no errors and an assigned fid, but sometimes, the master send this error: curl http://10.9.2.72:9333/dir/assign {"error":"Cannot grow volume group! Failed to grown volume for data center "}

The master says in its logs:

I1029 14:09:44 04931 topology.go:76] Topology does not have free space left!

Here is the full Logs of the Master: I1029 14:09:38 04931 sequence.go:35] Setting file id sequence 10000 I1029 14:09:38 04931 topology.go:58] Using default configurations. I1029 14:09:38 04931 master.go:218] Volume Size Limit is 32768 MB I1029 14:09:38 04931 master.go:232] Start Weed Master 0.43 at port 9333 I1029 14:09:41 04931 node.go:158] topo adds child DefaultDataCenter I1029 14:09:41 04931 node.go:158] topo:DefaultDataCenter adds child DefaultRack I1029 14:09:41 04931 node.go:158] topo:DefaultDataCenter:DefaultRack adds child homer:7301 I1029 14:09:44 04931 topology.go:105] adding replication type 000 I1029 14:09:44 04931 topology.go:105] adding replication type 000 I1029 14:09:44 04931 volume_growth.go:207] Created Volume 1 on topo:DefaultDataCenter:DefaultRack:homer:7301 I1029 14:09:44 04931 sequence.go:68] Saving file id sequence 20000 to weed.seq I1029 14:09:44 04931 volume_growth.go:207] Created Volume 2 on topo:DefaultDataCenter:DefaultRack:homer:7301 I1029 14:09:44 04931 volume_growth.go:207] Created Volume 3 on topo:DefaultDataCenter:DefaultRack:homer:7301 I1029 14:09:44 04931 volume_growth.go:207] Created Volume 4 on topo:DefaultDataCenter:DefaultRack:homer:7301 I1029 14:09:44 04931 volume_growth.go:207] Created Volume 5 on topo:DefaultDataCenter:DefaultRack:homer:7301 I1029 14:09:44 04931 volume_growth.go:207] Created Volume 6 on topo:DefaultDataCenter:DefaultRack:homer:7301 I1029 14:09:44 04931 volume_growth.go:207] Created Volume 7 on topo:DefaultDataCenter:DefaultRack:homer:7301 I1029 14:09:44 04931 topology.go:76] Topology does not have free space left!

Here is the full logs of the Volume: I1029 14:08:51 28863 volume.go:329] Volume Folder . I1029 14:08:51 28863 volume.go:330] Permission: -rwxrwxr-x I1029 14:08:51 28863 store.go:173] Store started on dir: . with 0 volumes max 7 I1029 14:08:51 28863 volume.go:371] store joined at 10.9.2.72:9333 I1029 14:08:51 28863 volume.go:373] Start Weed volume server 0.43 at http://homer:7301 I1029 14:08:55 28863 store.go:98] In dir . adds volume = 1 , replicationType = 000 I1029 14:08:55 28863 volume.go:99] loading file 1.idx readonly false I1029 14:08:55 28863 store.go:98] In dir . adds volume = 2 , replicationType = 000 I1029 14:08:55 28863 volume.go:99] loading file 2.idx readonly false I1029 14:08:55 28863 store.go:98] In dir . adds volume = 3 , replicationType = 000 I1029 14:08:55 28863 volume.go:99] loading file 3.idx readonly false I1029 14:08:55 28863 store.go:98] In dir . adds volume = 4 , replicationType = 000 I1029 14:08:55 28863 volume.go:99] loading file 4.idx readonly false I1029 14:08:55 28863 store.go:98] In dir . adds volume = 5 , replicationType = 000 I1029 14:08:55 28863 volume.go:99] loading file 5.idx readonly false I1029 14:08:55 28863 store.go:98] In dir . adds volume = 6 , replicationType = 000 I1029 14:08:55 28863 volume.go:99] loading file 6.idx readonly false I1029 14:08:55 28863 store.go:98] In dir . adds volume = 7 , replicationType = 000 I1029 14:08:55 28863 volume.go:99] loading file 7.idx readonly false

If I increase the maximum volumes to 20 (-max="20" on the volume server), I did not get the error but the master then creates 14 (!) volumes instead of 7.

What version of the product are you using? On what operating system?

.43, volume on ubuntu linux, master on macos, gigabit ethernet between the master & the volume

Comment #1

Posted on Oct 29, 2013 by Happy Horse

Comment deleted

Comment #2

Posted on Oct 29, 2013 by Happy Horse

AFAICT, VolumeGrowth struct is locked way too late: multiple request are trying to grow volumes but the first request already succeed in growing the volumes, when the VolumeGrowth.accessLock is released there is no more free space in the topology...

Instead it should be locked before. Here is a quick&dirty patch that fix the issue:

From 6fef495467227fd00a8d87931c9621cf7a2cbe36 Mon Sep 17 00:00:00 2001 Date: Tue, 29 Oct 2013 15:36:50 +0100 Subject: [PATCH] Issue #51: lock early and double check before trying to grow volumes.


go/weed/master.go | 13 ++++++++++--- 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/go/weed/master.go b/go/weed/master.go index 7aa48e5..369e1a1 100644 --- a/go/weed/master.go +++ b/go/weed/master.go @@ -15,6 +15,7 @@ import ( "strconv" "strings" "time" + "sync" )

func init() { @@ -48,6 +49,7 @@ var (

var topo *topology.Topology var vg *replication.VolumeGrowth +var vgLock sync.Mutex

func dirLookupHandler(w http.ResponseWriter, r *http.Request) { vid := r.FormValue("volumeId") @@ -97,9 +99,14 @@ func dirAssignHandler(w http.ResponseWriter, r *http.Request) { writeJsonQuiet(w, r, map[string]string{"error": "No free volumes left!"}) return } else { - if _, err = vg.AutomaticGrowByType(rt, dataCenter, topo); err != nil { - writeJsonQuiet(w, r, map[string]string{"error": "Cannot grow volume group! " + err.Error()}) - return + vgLock.Lock() + defer vgLock.Unlock() + // redo the check for active volumes to avoid concurrency issues. + if topo.GetVolumeLayout(rt).GetActiveVolumeCount(dataCenter) <= 0 { + if _, err = vg.AutomaticGrowByType(rt, dataCenter, topo); err != nil { + writeJsonQuiet(w, r, map[string]string{"error": "Cannot grow volume group! " + err.Error()}) + return + } } }

}

1.8.3.4 (Apple Git-47)+GitX

Comment #3

Posted on Oct 29, 2013 by Happy Horse

I've commited the change to a github fork: https://github.com/zenria/weed-fs/commit/779c22d74073b327959b48cd53684509ebcabe56

Comment #4

Posted on Oct 29, 2013 by Grumpy Cat

Thanks for the fix! Checked into HEAD now.

Status: Fixed

Labels:
Type-Defect Priority-Medium