My favorites | Sign in
Project Home Downloads Wiki Issues Source
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 270: drbd sync fails on readded node
1 person starred this issue and may be notified of changes. Back to list
Status:  NeedFeedback
Owner:  ----


Sign in to add a comment
 
Reported by eyelessf...@gmail.com, Sep 25, 2012
What software version are you running? Please provide the output of "gnt-
cluster --version" and "gnt-cluster version".
gnt-cluster (ganeti v2.4.5) 2.4.5

Software version: 2.4.5
Internode protocol: 2040000
Configuration format: 2040000
OS api version: 20
Export interface: 0

What distribution are you using?
Gentoo and Ubuntu 12.04

both ganeti and drbd is the same version:
version: 8.3.11 (api:88/proto:86-96)
What steps will reproduce the problem?
1. Reinstalled node2
2. readded node2
3. gnt-instance replace-disks -s test1

What is the expected output? What do you see instead?

Please provide any additional information below.

Sync fails after some time syncing telling the disk are degraded.

gnt-instance replace-disks -s test1
Tue Sep 25 21:36:25 2012 Replacing disk(s) 0 for test1.lan
Tue Sep 25 21:36:36 2012 STEP 1/6 Check device existence
Tue Sep 25 21:36:36 2012  - INFO: Checking disk/0 on node1.lan
Tue Sep 25 21:36:40 2012  - INFO: Checking disk/0 on bestefar.ss.uib.no
Tue Sep 25 21:36:40 2012  - INFO: Checking volume groups
Tue Sep 25 21:36:42 2012 STEP 2/6 Check peer consistency
Tue Sep 25 21:36:42 2012  - INFO: Checking disk/0 consistency on node node1.lan
Failure: command execution error:
Node node1.lan has degraded storage, unsafe to replace disks for instance test1.lan


primary:
Sep 25 21:35:16 node1 ganeti-rapi[17004]: INFO 127.0.0.1:43227 GET /version HTTP/1.1 200
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq1) INFO Received config values query request for ['master_node']
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq5) INFO Received job autoarchive request for age 21600, timeout 29
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq5) INFO Archiving jobs with age more than 21600 seconds
Sep 25 21:35:16 node1 ganeti-masterd[676]: (ClientReq15) INFO Received new job
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq11) INFO Received job poll request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq7) INFO Received job poll request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (JobQueue3/Job81612) INFO Op 1/2: opcode INSTANCE_QUERY waiting for locks
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq10) INFO Received job poll request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq13) INFO Received job poll request for 81612
Sep 25 21:35:17 node1 ganeti-noded[17015]: INFO 192.168.1.35:51808 PUT /all_instances_info HTTP/1.1 200
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq3) INFO Received job poll request for 81612
Sep 25 21:35:17 node1 ganeti-masterd[676]: (JobQueue22/Job81612) INFO Op 2/2: opcode NODE_QUERY waiting for locks
Sep 25 21:35:17 node1 ganeti-masterd[676]: (ClientReq4) INFO Received job poll request for 81612
Sep 25 21:35:18 node1 ganeti-masterd[676]: (ClientReq8) INFO Received job poll request for 81612
Sep 25 21:35:19 node1 ganeti-noded[17038]: INFO 192.168.1.35:51813 PUT /node_info HTTP/1.1 200
Sep 25 21:35:19 node1 ganeti-masterd[676]: (JobQueue22/Job81612) INFO Finished job 81612, status = success
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq2) INFO Received job query request for 81612
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq14) INFO Received job archive request for 81612
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq14) INFO Archiving job 81612
Sep 25 21:35:19 node1 ganeti-masterd[676]: (ClientReq12) INFO Received new job
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq16) INFO Received job poll request for 81613
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq9) INFO Received job poll request for 81613
Sep 25 21:35:20 node1 ganeti-masterd[676]: (JobQueue13/Job81613) INFO Op 1/1: opcode CLUSTER_VERIFY_DISKS waiting for locks
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq6) INFO Received job poll request for 81613
Sep 25 21:35:20 node1 ganeti-masterd[676]: (ClientReq1) INFO Received job poll request for 81613
Sep 25 21:35:21 node1 ganeti-noded[17055]: INFO 192.168.1.35:51821 PUT /lv_list HTTP/1.1 200
Sep 25 21:35:21 node1 ganeti-masterd[676]: (JobQueue13/Job81613) INFO Finished job 81613, status = success
Sep 25 21:35:22 node1 ganeti-masterd[676]: (ClientReq5) INFO Received job query request for 81613
Sep 25 21:35:22 node1 ganeti-masterd[676]: (ClientReq15) INFO Received job archive request for 81613
Sep 25 21:35:22 node1 ganeti-masterd[676]: (ClientReq15) INFO Archiving job 81613
Sep 25 21:36:24 node1 gnt-instance replace-disks[17172]: INFO run with arguments '-s test1'
Sep 25 21:36:24 node1 ganeti-masterd[676]: (MainThread) INFO Accepted connection from pid=17172, uid=0, gid=0
Sep 25 21:36:24 node1 ganeti-masterd[676]: (ClientReq11) INFO Received new job
Sep 25 21:36:24 node1 ganeti-masterd[676]: (ClientReq7) INFO Received job poll request for 81614
Sep 25 21:36:24 node1 ganeti-masterd[676]: (ClientReq10) INFO Received job poll request for 81614
Sep 25 21:36:25 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Op 1/1: opcode INSTANCE_REPLACE_DISKS(test1) waiting for locks
Sep 25 21:36:25 node1 ganeti-masterd[676]: (ClientReq13) INFO Received job poll request for 81614
Sep 25 21:36:25 node1 ganeti-masterd[676]: (ClientReq3) INFO Received job poll request for 81614
Sep 25 21:36:25 node1 ganeti-noded[17183]: INFO 192.168.1.35:51829 PUT /hooks_runner HTTP/1.1 200
Sep 25 21:36:25 node1 ganeti-masterd[676]: (ClientReq4) INFO Received job poll request for 81614
Sep 25 21:36:29 node1 kernel: [10648783.529804] block drbd1: Starting worker thread (from drbdsetup [17196])
Sep 25 21:36:29 node1 kernel: [10648783.529911] block drbd1: disk( Diskless -> Attaching ) 
Sep 25 21:36:29 node1 kernel: [10648783.531221] block drbd1: No usable activity log found.
Sep 25 21:36:29 node1 kernel: [10648783.531225] block drbd1: Method to ensure write ordering: flush
Sep 25 21:36:29 node1 kernel: [10648783.531232] block drbd1: drbd_bm_resize called with capacity == 41943040
Sep 25 21:36:29 node1 kernel: [10648783.531443] block drbd1: resync bitmap: bits=5242880 words=81920 pages=160
Sep 25 21:36:29 node1 kernel: [10648783.531447] block drbd1: size = 20 GB (20971520 KB)
Sep 25 21:36:29 node1 kernel: [10648783.536060] block drbd1: bitmap READ of 160 pages took 5 jiffies
Sep 25 21:36:29 node1 kernel: [10648783.536797] block drbd1: recounting of set bits took additional 0 jiffies
Sep 25 21:36:29 node1 kernel: [10648783.536800] block drbd1: 17 GB (4388992 bits) marked out-of-sync by on disk bit-map.
Sep 25 21:36:29 node1 kernel: [10648783.536807] block drbd1: disk( Attaching -> UpToDate ) pdsk( DUnknown -> Outdated ) 
Sep 25 21:36:29 node1 kernel: [10648783.536812] block drbd1: attached to UUIDs 3534D005D75C18E0:0005000000000004:0004000000000004:0003000000000004
Sep 25 21:36:29 node1 kernel: [10648783.575210] block drbd1: conn( StandAlone -> Unconnected ) 
Sep 25 21:36:29 node1 kernel: [10648783.575239] block drbd1: Starting receiver thread (from drbd1_worker [17197])
Sep 25 21:36:29 node1 kernel: [10648783.575288] block drbd1: receiver (re)started
Sep 25 21:36:29 node1 kernel: [10648783.575297] block drbd1: conn( Unconnected -> WFConnection ) 
Sep 25 21:36:29 node1 ganeti-noded[17187]: INFO 192.168.1.35:51830 PUT /blockdev_assemble HTTP/1.1 200
Sep 25 21:36:30 node1 kernel: [10648784.383073] block drbd1: Handshake successful: Agreed network protocol version 96
Sep 25 21:36:30 node1 kernel: [10648784.383365] block drbd1: Peer authenticated using 16 bytes of 'md5' HMAC
Sep 25 21:36:30 node1 kernel: [10648784.383375] block drbd1: conn( WFConnection -> WFReportParams ) 
Sep 25 21:36:30 node1 kernel: [10648784.383396] block drbd1: Starting asender thread (from drbd1_receiver [17212])
Sep 25 21:36:30 node1 kernel: [10648784.383869] block drbd1: data-integrity-alg: <not-used>
Sep 25 21:36:30 node1 kernel: [10648784.383896] block drbd1: drbd_sync_handshake:
Sep 25 21:36:30 node1 kernel: [10648784.383900] block drbd1: self 3534D005D75C18E0:0005000000000004:0004000000000004:0003000000000004 bits:4388992 flags:0
Sep 25 21:36:30 node1 kernel: [10648784.383905] block drbd1: peer 0005000000000004:0000000000000000:0000000000000000:0000000000000000 bits:4388992 flags:0
Sep 25 21:36:30 node1 kernel: [10648784.383909] block drbd1: uuid_compare()=1 by rule 70
Sep 25 21:36:30 node1 kernel: [10648784.383912] block drbd1: Becoming sync source due to disk states.
Sep 25 21:36:30 node1 kernel: [10648784.383922] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapS ) pdsk( Outdated -> Inconsistent ) 
Sep 25 21:36:30 node1 kernel: [10648784.431059] block drbd1: helper command: /bin/true before-resync-source minor-1
Sep 25 21:36:30 node1 kernel: [10648784.431734] block drbd1: helper command: /bin/true before-resync-source minor-1 exit code 0 (0x0)
Sep 25 21:36:30 node1 kernel: [10648784.431741] block drbd1: conn( WFBitMapS -> SyncSource ) 
Sep 25 21:36:30 node1 kernel: [10648784.431751] block drbd1: Began resync as SyncSource (will sync 17555968 KB [4388992 bits set]).
Sep 25 21:36:30 node1 kernel: [10648784.431756] block drbd1: updated sync UUID 3534D005D75C18E0:0006000000000004:0005000000000004:0004000000000004
Sep 25 21:36:36 node1 kernel: [10648790.672360] block drbd1: role( Secondary -> Primary ) 
Sep 25 21:36:36 node1 ganeti-noded[17222]: INFO 192.168.1.35:51832 PUT /blockdev_assemble HTTP/1.1 200
Sep 25 21:36:36 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking disk/0 on node1.lan
Sep 25 21:36:37 node1 ganeti-masterd[676]: (ClientReq8) INFO Received job poll request for 81614
Sep 25 21:36:37 node1 ganeti-masterd[676]: (ClientReq2) INFO Received job poll request for 81614
Sep 25 21:36:40 node1 ganeti-noded[17277]: INFO 192.168.1.35:51833 PUT /blockdev_find HTTP/1.1 200
Sep 25 21:36:40 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking disk/0 on bestefar.ss.uib.no
Sep 25 21:36:40 node1 ganeti-masterd[676]: (ClientReq14) INFO Received job poll request for 81614
Sep 25 21:36:40 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking volume groups
Sep 25 21:36:40 node1 ganeti-masterd[676]: (ClientReq12) INFO Received job poll request for 81614
Sep 25 21:36:42 node1 ganeti-noded[17288]: INFO 192.168.1.35:51835 PUT /vg_list HTTP/1.1 200
Sep 25 21:36:42 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Checking disk/0 consistency on node node1.lan
Sep 25 21:36:42 node1 ganeti-masterd[676]: (ClientReq16) INFO Received job poll request for 81614
Sep 25 21:36:42 node1 ganeti-masterd[676]: (ClientReq9) INFO Received job poll request for 81614
Sep 25 21:36:45 node1 ganeti-noded[17295]: INFO 192.168.1.35:51837 PUT /blockdev_find HTTP/1.1 200
Sep 25 21:36:45 node1 ganeti-noded[17303]: INFO 192.168.1.35:51838 PUT /instance_list HTTP/1.1 200
Sep 25 21:36:48 node1 kernel: [10648802.423076] block drbd1: role( Primary -> Secondary ) 
Sep 25 21:36:48 node1 kernel: [10648802.482407] block drbd1: peer( Secondary -> Unknown ) conn( SyncSource -> Disconnecting ) 
Sep 25 21:36:48 node1 kernel: [10648802.482514] block drbd1: asender terminated
Sep 25 21:36:48 node1 kernel: [10648802.482520] block drbd1: Terminating drbd1_asender
Sep 25 21:36:48 node1 kernel: [10648802.493169] block drbd1: bitmap WRITE of 152 pages took 3 jiffies
Sep 25 21:36:48 node1 kernel: [10648802.516910] block drbd1: 16 GB (4112512 bits) marked out-of-sync by on disk bit-map.
Sep 25 21:36:48 node1 kernel: [10648802.516922] block drbd1: Connection closed
Sep 25 21:36:48 node1 kernel: [10648802.516932] block drbd1: conn( Disconnecting -> StandAlone ) 
Sep 25 21:36:48 node1 kernel: [10648802.516954] block drbd1: receiver terminated
Sep 25 21:36:48 node1 kernel: [10648802.516957] block drbd1: Terminating drbd1_receiver
Sep 25 21:36:48 node1 kernel: [10648802.516997] block drbd1: disk( UpToDate -> Failed ) 
Sep 25 21:36:48 node1 kernel: [10648802.517023] block drbd1: Sending state for detaching disk failed
Sep 25 21:36:48 node1 kernel: [10648802.517037] block drbd1: disk( Failed -> Diskless ) 
Sep 25 21:36:48 node1 kernel: [10648802.517221] block drbd1: drbd_bm_resize called with capacity == 0
Sep 25 21:36:48 node1 kernel: [10648802.517270] block drbd1: worker terminated
Sep 25 21:36:48 node1 kernel: [10648802.517273] block drbd1: Terminating drbd1_worker
Sep 25 21:36:50 node1 ganeti-noded[17305]: INFO 192.168.1.35:51839 PUT /blockdev_shutdown HTTP/1.1 200
Sep 25 21:36:50 node1 ganeti-masterd[676]: (JobQueue25/Job81614) ERROR Op 1/1: Caught exception in INSTANCE_REPLACE_DISKS(test1)
Sep 25 21:36:50 node1 Traceback (most recent call last):
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/jqueue.py", line 936, in _ExecOpCodeUnlocked
Sep 25 21:36:50 node1 timeout=timeout, priority=op.priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", line 352, in ExecOpCode
Sep 25 21:36:50 node1 priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", line 296, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._LockAndExecLU(lu, level + 1, calc_timeout, priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", line 305, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._LockAndExecLU(lu, level + 1, calc_timeout, priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", line 296, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._LockAndExecLU(lu, level + 1, calc_timeout, priority)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", line 256, in _LockAndExecLU
Sep 25 21:36:50 node1 result = self._ExecLU(lu)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/mcpu.py", line 231, in _ExecLU
Sep 25 21:36:50 node1 result = lu.Exec(self.Log)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 270, in Exec
Sep 25 21:36:50 node1 tl.Exec(feedback_fn)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 8365, in Exec
Sep 25 21:36:50 node1 result = fn(feedback_fn)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 8536, in _ExecDrbd8DiskOnly
Sep 25 21:36:50 node1 False)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cmdlib.py", line 8431, in _CheckDisksConsistency
Sep 25 21:36:50 node1 (node_name, self.instance.name))
Sep 25 21:36:50 node1 OpExecError: Node node1.lan has degraded storage, unsafe to replace disks for instance test1.lan
Sep 25 21:36:50 node1 ganeti-masterd[676]: (JobQueue25/Job81614) INFO Finished job 81614, status = error
Sep 25 21:36:50 node1 ganeti-masterd[676]: (ClientReq6) INFO Received job query request for 81614
Sep 25 21:36:50 node1 gnt-instance replace-disks[17172]: ERROR Error during command processing
Sep 25 21:36:50 node1 Traceback (most recent call last):
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", line 1939, in GenericMain
Sep 25 21:36:50 node1 result = func(options, args)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/client/gnt_instance.py", line 781, in ReplaceDisks
Sep 25 21:36:50 node1 SubmitOrSend(op, opts)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", line 1757, in SubmitOrSend
Sep 25 21:36:50 node1 return SubmitOpCode(op, cl=cl, feedback_fn=feedback_fn, opts=opts)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", line 1734, in SubmitOpCode
Sep 25 21:36:50 node1 reporter=reporter)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", line 1715, in PollJob
Sep 25 21:36:50 node1 return GenericPollJob(job_id, _LuxiJobPollCb(cl), reporter)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/cli.py", line 1537, in GenericPollJob
Sep 25 21:36:50 node1 errors.MaybeRaise(msg)
Sep 25 21:36:50 node1 File "/usr/lib64/python2.7/site-packages/ganeti/errors.py", line 448, in MaybeRaise
Sep 25 21:36:50 node1 raise errcls, args
Sep 25 21:36:50 node1 OpExecError: Node node1.lan has degraded storage, unsafe to replace disks for instance test1.lan




secondary:
Sep 25 21:36:29 node2 kernel: [  597.351382] block drbd1: disk( Diskless -> Attaching ) 
Sep 25 21:36:29 node2 kernel: [  597.352712] block drbd1: No usable activity log found.
Sep 25 21:36:29 node2 kernel: [  597.352717] block drbd1: Method to ensure write ordering: flush
Sep 25 21:36:29 node2 kernel: [  597.352725] block drbd1: drbd_bm_resize called with capacity == 41943040
Sep 25 21:36:29 node2 kernel: [  597.352878] block drbd1: resync bitmap: bits=5242880 words=81920 pages=160
Sep 25 21:36:29 node2 kernel: [  597.352882] block drbd1: size = 20 GB (20971520 KB)
Sep 25 21:36:29 node2 kernel: [  597.355536] block drbd1: bitmap READ of 160 pages took 1 jiffies
Sep 25 21:36:29 node2 kernel: [  597.355673] block drbd1: recounting of set bits took additional 0 jiffies
Sep 25 21:36:29 node2 kernel: [  597.355676] block drbd1: 17 GB (4388992 bits) marked out-of-sync by on disk bit-map.
Sep 25 21:36:29 node2 kernel: [  597.355682] block drbd1: disk( Attaching -> Inconsistent ) 
Sep 25 21:36:29 node2 kernel: [  597.355685] block drbd1: attached to UUIDs 0005000000000004:0000000000000000:0000000000000000:0000000000000000
Sep 25 21:36:29 node2 kernel: [  597.395008] block drbd1: conn( StandAlone -> Unconnected ) 
Sep 25 21:36:29 node2 kernel: [  597.395058] block drbd1: Starting receiver thread (from drbd1_worker [3587])
Sep 25 21:36:29 node2 kernel: [  597.395300] block drbd1: receiver (re)started
Sep 25 21:36:29 node2 kernel: [  597.395308] block drbd1: conn( Unconnected -> WFConnection ) 
Sep 25 21:36:30 node2 kernel: [  597.895093] block drbd1: Handshake successful: Agreed network protocol version 96
Sep 25 21:36:30 node2 kernel: [  597.895335] block drbd1: Peer authenticated using 16 bytes of 'md5' HMAC
Sep 25 21:36:30 node2 kernel: [  597.895343] block drbd1: conn( WFConnection -> WFReportParams ) 
Sep 25 21:36:30 node2 kernel: [  597.895451] block drbd1: Starting asender thread (from drbd1_receiver [3601])
Sep 25 21:36:30 node2 kernel: [  597.895703] block drbd1: data-integrity-alg: <not-used>
Sep 25 21:36:30 node2 kernel: [  597.895720] block drbd1: drbd_sync_handshake:
Sep 25 21:36:30 node2 kernel: [  597.895724] block drbd1: self 0005000000000004:0000000000000000:0000000000000000:0000000000000000 bits:4388992 flags:0
Sep 25 21:36:30 node2 kernel: [  597.895729] block drbd1: peer 3534D005D75C18E0:0005000000000004:0004000000000004:0003000000000004 bits:4388992 flags:0
Sep 25 21:36:30 node2 kernel: [  597.895732] block drbd1: uuid_compare()=-1 by rule 50
Sep 25 21:36:30 node2 kernel: [  597.895735] block drbd1: Becoming sync target due to disk states.
Sep 25 21:36:30 node2 kernel: [  597.895742] block drbd1: peer( Unknown -> Secondary ) conn( WFReportParams -> WFBitMapT ) pdsk( DUnknown -> UpToDate ) 
Sep 25 21:36:30 node2 kernel: [  597.941628] block drbd1: conn( WFBitMapT -> WFSyncUUID ) 
Sep 25 21:36:30 node2 kernel: [  597.985957] block drbd1: updated sync uuid 0006000000000004:0000000000000000:0000000000000000:0000000000000000
Sep 25 21:36:30 node2 kernel: [  597.994754] block drbd1: helper command: /bin/true before-resync-target minor-1
Sep 25 21:36:30 node2 kernel: [  597.995590] block drbd1: helper command: /bin/true before-resync-target minor-1 exit code 0 (0x0)
Sep 25 21:36:30 node2 kernel: [  597.995597] block drbd1: conn( WFSyncUUID -> SyncTarget ) 
Sep 25 21:36:30 node2 kernel: [  597.995606] block drbd1: Began resync as SyncTarget (will sync 17555968 KB [4388992 bits set]).
Sep 25 21:36:36 node2 kernel: [  604.174503] block drbd1: peer( Secondary -> Primary ) 
Sep 25 21:36:48 node2 kernel: [  615.966592] block drbd1: peer( Primary -> Secondary ) 
Sep 25 21:36:48 node2 kernel: [  615.966815] block drbd1: peer( Secondary -> Unknown ) conn( SyncTarget -> TearDown ) pdsk( UpToDate -> DUnknown ) 
Sep 25 21:36:48 node2 kernel: [  615.967136] block drbd1: meta connection shut down by peer.
Sep 25 21:36:48 node2 kernel: [  615.967627] block drbd1: asender terminated
Sep 25 21:36:48 node2 kernel: [  615.967651] block drbd1: Terminating drbd1_asender
Sep 25 21:36:48 node2 kernel: [  615.977086] block drbd1: bitmap WRITE of 152 pages took 2 jiffies
Sep 25 21:36:48 node2 kernel: [  615.997371] block drbd1: 16 GB (4112512 bits) marked out-of-sync by on disk bit-map.
Sep 25 21:36:48 node2 kernel: [  615.997392] block drbd1: Connection closed
Sep 25 21:36:48 node2 kernel: [  615.997398] block drbd1: conn( TearDown -> Unconnected ) 
Sep 25 21:36:48 node2 kernel: [  615.997404] block drbd1: receiver terminated
Sep 25 21:36:48 node2 kernel: [  615.997406] block drbd1: Restarting drbd1_receiver
Sep 25 21:36:48 node2 kernel: [  615.997409] block drbd1: receiver (re)started
Sep 25 21:36:48 node2 kernel: [  615.997414] block drbd1: conn( Unconnected -> WFConnection ) 
Sep 25 21:36:50 node2 kernel: [  618.163815] block drbd1: conn( WFConnection -> Disconnecting ) 
Sep 25 21:36:50 node2 kernel: [  618.163905] block drbd1: Discarding network configuration.
Sep 25 21:36:50 node2 kernel: [  618.164037] block drbd1: Connection closed
Sep 25 21:36:50 node2 kernel: [  618.164051] block drbd1: conn( Disconnecting -> StandAlone ) 
Sep 25 21:36:50 node2 kernel: [  618.164238] block drbd1: receiver terminated
Sep 25 21:36:50 node2 kernel: [  618.164243] block drbd1: Terminating drbd1_receiver
Sep 25 21:36:50 node2 kernel: [  618.164301] block drbd1: disk( Inconsistent -> Failed ) 
Sep 25 21:36:50 node2 kernel: [  618.164369] block drbd1: Sending state for detaching disk failed
Sep 25 21:36:50 node2 kernel: [  618.164848] block drbd1: disk( Failed -> Diskless ) 
Sep 25 21:36:50 node2 kernel: [  618.166200] block drbd1: drbd_bm_resize called with capacity == 0
Sep 25 21:36:50 node2 kernel: [  618.166243] block drbd1: worker terminated
Sep 25 21:36:50 node2 kernel: [  618.166246] block drbd1: Terminating drbd1_worker

Sep 25, 2012
#1 eyelessf...@gmail.com
It seems to work with gnt-instance replace-disks -a test1
Dec 9, 2013
Project Member #2 ultrot...@gmail.com
Have you tried with --ignore-consistency as well? Are you sure the primary of the instance was correct?

Thanks,

Guido
Status: NeedFeedback
Jan 29, 2014
#3 regna...@gmail.com
--ignore-consistency is not valid for replace-disks. -a fails for 

I am suffering from exactly the same problem on a production 2 node cluster I'm in the middle of upgrading. In my test environment (and in prod) we have 2 Debian 6 squeeze with Ganeti 2.1, which we in turn upgraded to Debian 7 and Ganeti 2.9 from wheezy-backports.

Due to hardware changes on the platform, we have to rebuild each node in turn. This works perfectly in our test environment (virtualized), with the node A (master) running a dist-upgraded 7.3, Ganeti 2.9.2, and node B (slave) is running a freshly installed 7.3, Ganeti 2.9.2. KVM was also upgraded from 1.1.2 to 1.7 on the A node to match KVM version on node B. DRBD is version 8.3.13 on A, 8.14 on B.

I simulated the entire rebuild with the above configuration yesterday, and now we are stuck as replace-disks fails in the manner described above.

On the test setup, replace-disks -s just worked.

On the prod setup, it failed:

OpExecError: Can't find disk/0 on node B: disk not found
Disks seem to be not properly activated. Try running activate-disks on the instance before using replace-disks.

So I did, it returned:
nodeA:disk/0:/dev/drbd3

Tried replace-disks again with -s:

Wed Jan 29 18:03:17 2014 STEP 2/6 Check peer consistency
Wed Jan 29 18:03:17 2014  - INFO: Checking disk/0 consistency on node A
Failure: command execution error:
Node A has degraded storage, unsafe to replace disks for instance instancename

Per the report above I tried with '-a':

Wed Jan 29 18:02:43 2014  - INFO: Checking disk/0 on B
Wed Jan 29 18:02:43 2014  - INFO: Checking disk/0 on A
Wed Jan 29 18:02:44 2014 No disks need replacement for instance 'instancename'

But clearly the disks are NOT in sync.

Any suggestions at this point ?
Jan 29, 2014
#4 regna...@gmail.com
Forgot to include the log messages when running replace-disks -s:


2014-01-29 18:11:11,812: ganeti-masterd pid=9336/Jq9/Job394610/I_REPLACE_DISKS INFO Checking volume groups
2014-01-29 18:11:12,006: ganeti-masterd pid=9336/Jq9/Job394610/I_REPLACE_DISKS INFO Checking disk/0 consistency on node A
2014-01-29 18:11:12,232: ganeti-masterd pid=9336/Jq9/Job394610 ERROR Op 1/1: Caught exception in INSTANCE_REPLACE_DISKS(instancename)
Traceback (most recent call last):
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1115, in _ExecOpCodeUnlocked
    timeout=timeout)
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1426, in _WrapExecOpCode
    return execop_fn(op, *args, **kwargs)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 517, in ExecOpCode
    calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 407, in _LockAndExecLU
    result = self._ExecLU(lu)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 374, in _ExecLU
    result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
  File "/usr/share/ganeti/ganeti/cmdlib/base.py", line 250, in Exec
    tl.Exec(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2158, in Exec
    result = fn(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2347, in _ExecDrbd8DiskOnly
    False)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2235, in _CheckDisksConsistency
    self.instance.name))
OpExecError: Node A has degraded storage, unsafe to replace disks for instance instancename
2014-01-29 18:11:12,241: ganeti-masterd pid=9336/ClientReq3 INFO Received job poll request for 394610
2014-01-29 18:11:12,243: ganeti-masterd pid=9336/ClientReq4 INFO Received job poll request for 394610
2014-01-29 18:11:12,370: ganeti-masterd pid=9336/Jq9/Job394610 INFO Finished job 394610, status = error
2014-01-29 18:11:12,486: ganeti-masterd pid=9336/ClientReq5 INFO Received job query request for 394610



Jan 29, 2014
#5 regna...@gmail.com
And some more traces. I tried reproducing this on node B by starting from scratch, and indeed:

# gnt-instance replace-disks -a instancename
Wed Jan 29 18:18:49 2014  - INFO: Checking disk/0 on B
Failure: prerequisites not met for this operation:
error type: wrong_state, error details:
Please run activate-disks on instance instancename first

# gnt-instance activate-disks instancename
nodeA:disk/0:/dev/drbd3

# gnt-instance replace-disks -a instancename
Wed Jan 29 18:19:55 2014  - INFO: Checking disk/0 on B
Wed Jan 29 18:19:55 2014  - INFO: Checking disk/0 on A
Wed Jan 29 18:19:56 2014 Replacing disk(s) 0 for instance 'instancename'
Wed Jan 29 18:19:56 2014 Current primary node: A
Wed Jan 29 18:19:56 2014 Current seconary node: B
Wed Jan 29 18:19:56 2014 STEP 1/6 Check device existence
Wed Jan 29 18:19:56 2014  - INFO: Checking disk/0 on A
Wed Jan 29 18:19:56 2014  - INFO: Checking disk/0 on B
Wed Jan 29 18:19:56 2014  - INFO: Checking volume groups
Wed Jan 29 18:19:56 2014 STEP 2/6 Check peer consistency
Wed Jan 29 18:19:56 2014  - INFO: Checking disk/0 consistency on node A
Wed Jan 29 18:19:57 2014 STEP 3/6 Allocate new storage
Wed Jan 29 18:19:57 2014  - INFO: Adding storage on B for disk/0
Wed Jan 29 18:19:58 2014 STEP 4/6 Changing drbd configuration
Wed Jan 29 18:19:58 2014  - INFO: Detaching disk/0 drbd from local storage
Wed Jan 29 18:19:58 2014  - INFO: Renaming the old LVs on the target node
Wed Jan 29 18:19:58 2014  - INFO: Renaming the new LVs on the target node
Wed Jan 29 18:19:59 2014  - INFO: Adding new mirror component on B
Wed Jan 29 18:20:01 2014 STEP 5/6 Sync devices
Wed Jan 29 18:20:01 2014  - INFO: Waiting for instance instancename to sync disks
Wed Jan 29 18:20:13 2014  - INFO: Instance instancename's disks are in sync

[here it seems to be running for about 10 seconds]

Failure: command execution error:
DRBD device disk/0 is degraded!


In the log, during the 10 second pause:


2014-01-29 18:20:01,574: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Waiting for instance instancename to sync disks
2014-01-29 18:20:01,819: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 10 retries left
2014-01-29 18:20:02,190: ganeti-masterd pid=9336/ClientReq1 INFO Received job poll request for 394618
2014-01-29 18:20:02,990: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 9 retries left
2014-01-29 18:20:04,192: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 8 retries left
2014-01-29 18:20:05,355: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 7 retries left
2014-01-29 18:20:06,527: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 6 retries left
2014-01-29 18:20:07,690: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 5 retries left
2014-01-29 18:20:08,855: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 4 retries left
2014-01-29 18:20:10,022: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 3 retries left
2014-01-29 18:20:11,184: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 2 retries left
2014-01-29 18:20:12,348: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Degraded disks found, 1 retries left
2014-01-29 18:20:13,507: ganeti-masterd pid=9336/Jq16/Job394618/I_REPLACE_DISKS INFO Instance instancename's disks are in sync
2014-01-29 18:20:13,710: ganeti-masterd pid=9336/Jq16/Job394618 ERROR Op 1/1: Caught exception in INSTANCE_REPLACE_DISKS(instancename)
Traceback (most recent call last):
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1115, in _ExecOpCodeUnlocked
    timeout=timeout)
  File "/usr/share/ganeti/ganeti/jqueue.py", line 1426, in _WrapExecOpCode
    return execop_fn(op, *args, **kwargs)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 517, in ExecOpCode
    calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 459, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 468, in _LockAndExecLU
    result = self._LockAndExecLU(lu, level + 1, calc_timeout)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 407, in _LockAndExecLU
    result = self._ExecLU(lu)
  File "/usr/share/ganeti/ganeti/mcpu.py", line 374, in _ExecLU
    result = _ProcessResult(submit_mj_fn, lu.op, lu.Exec(self.Log))
  File "/usr/share/ganeti/ganeti/cmdlib/base.py", line 250, in Exec
    tl.Exec(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2158, in Exec
    result = fn(feedback_fn)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2453, in _ExecDrbd8DiskOnly
    self._CheckDevices(self.instance.primary_node, iv_names)
  File "/usr/share/ganeti/ganeti/cmdlib/instance_storage.py", line 2300, in _CheckDevices
    raise errors.OpExecError("DRBD device %s is degraded!" % name)
OpExecError: DRBD device disk/0 is degraded!
2014-01-29 18:20:13,831: ganeti-masterd pid=9336/Jq16/Job394618 INFO Finished job 394618, status = error



Jan 29, 2014
#6 regna...@gmail.com
Ok, so it seems I've found the origin of the problem. For some reason, even after doing rmmod / demod -a / modprobe, the settings in /etc/modules (including usermode_helper) hadn't been picked up.

Looking here https://groups.google.com/forum/#!msg/ganeti/7h1i-yWcp4s/FHQzVY_6DB0J I was reminded to check  /sys/module/drbd/parameters/, and indeed the helper was still /sbin/drbdadm. I rebooted the prod slave node, and now it's behaving as expected.

Still something odd: modinfo drbd tells me I'm running 8.3.11 drbd in the kernel, but the utils are 8.4:

ii  drbd8-utils                        2:8.4.4-1~bpo70+1             amd64

I'll ask about this on the discussion group.

Sign in to add a comment

Powered by Google Project Hosting