Export to GitHub

redis - issue #522

Master/Slave connection flapping


Posted on Apr 14, 2011 by Massive Monkey

What version of Redis you are using, in what kind of Operating System? Redis 2.2.4, master and slave Slave uses TCMALLOC, Master does not Linux Dedicated hardware

What is the problem you are experiencing? Very strange behavior where the master-slave connection times out and reconnects repeatedly, forcing multiple BGSAVE for SYNCs. From what we can tell, the network is solid. Related to this, a client that is connected to the master has a keepalive ping that occasionally takes exactly 107.3 seconds to return (but it does enventually return), though the period between events is inconsistent -- anywhere between 5 and 14 minutes.

What steps will reproduce the problem? Great question! Not sure yet, but still digging. Do you have an INFO output? Please past it here. Info captured during the period of time that the client does not immediately receive a response from the master.

Master: redis_version:2.2.4 redis_git_sha1:00000000 redis_git_dirty:0 arch_bits:64 multiplexing_api:epoll process_id:12013 uptime_in_seconds:43962 uptime_in_days:0 lru_clock:256199 used_cpu_sys:1.67 used_cpu_user:2.54 used_cpu_sys_childrens:0.68 used_cpu_user_childrens:0.26 connected_clients:3 connected_slaves:1 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0 used_memory:1752264 used_memory_human:1.67M used_memory_rss:2899968 mem_fragmentation_ratio:1.65 use_tcmalloc:0 loading:0 aof_enabled:0 changes_since_last_save:1 bgsave_in_progress:0 last_save_time:1302795404 bgrewriteaof_in_progress:0 total_connections_received:198 total_commands_processed:2693 expired_keys:1 evicted_keys:0 keyspace_hits:733 keyspace_misses:23 hash_max_zipmap_entries:64 hash_max_zipmap_value:512 pubsub_channels:0 pubsub_patterns:0 vm_enabled:0 role:master allocation_stats:2=204,6=1,7=1,8=2798,9=696,10=5798,11=212,12=590,13=2737,14=188,15=6540,16=27115,17=866,18=1740,19=149,20=167,21=32,22=293,23=154,24=8117,25=7,26=5,27=105,28=6,29=6,30=10,31=4,32=2047,34=5,35=6,36=9,37=52,38=27,39=66,40=122,41=6,42=170,43=4,44=2,45=7,46=4,47=43,48=810,49=2,50=25,51=134,52=328,53=3,55=1,56=22,57=7,58=4,59=87,60=159,61=1,62=1,63=1,64=958,65=10,66=1,67=3,68=4,69=4,70=24,71=8,72=51,73=4,74=3,75=13,76=3,77=3,78=7,79=7,80=13,81=18,82=20,83=33,84=25,85=18,86=16,87=12,88=1153,89=9,90=3,91=17,99=2,100=22,101=1,107=2,108=16,114=6,115=2,117=16,123=2,127=16,128=10,129=8,131=2,139=2,141=16,147=3,155=2,157=2,163=2,165=1,167=1,171=2,176=16,179=2,181=10,182=6,187=2,195=2,203=2,211=2,213=1,219=2,223=1,227=2,233=1,235=2,243=3,244=16,251=8,>=256=1296 db0:keys=1153,expires=0

Slave: redis_version:2.2.4 redis_git_sha1:00000000 redis_git_dirty:0 arch_bits:64 multiplexing_api:epoll process_id:8561 uptime_in_seconds:822 uptime_in_days:0 lru_clock:256198 used_cpu_sys:0.07 used_cpu_user:0.14 used_cpu_sys_childrens:0.00 used_cpu_user_childrens:0.00 connected_clients:3 connected_slaves:0 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:0 used_memory:1580400 used_memory_human:1.51M used_memory_rss:3108864 mem_fragmentation_ratio:1.97 use_tcmalloc:1 loading:0 aof_enabled:0 changes_since_last_save:1 bgsave_in_progress:0 last_save_time:1302795403 bgrewriteaof_in_progress:0 total_connections_received:5 total_commands_processed:192 expired_keys:0 evicted_keys:0 keyspace_hits:1 keyspace_misses:0 hash_max_zipmap_entries:512 hash_max_zipmap_value:64 pubsub_channels:0 pubsub_patterns:0 vm_enabled:0 role:slave master_host:10.0.0.5 master_port:6379 master_link_status:up master_last_io_seconds_ago:34 master_sync_in_progress:0 allocation_stats:2=408,6=1,7=2,8=2092,9=65,10=11412,11=304,12=1022,13=495,14=158,15=4264,16=33369,17=271,18=3441,19=206,20=95,21=58,22=486,23=89,24=15069,25=9,26=5,27=12,28=6,29=6,30=16,31=2,32=2244,34=6,35=9,36=10,37=5,38=45,39=116,40=105,41=10,42=338,43=5,44=4,45=11,46=5,47=1,48=48,49=2,50=93,51=265,52=655,53=2,56=8,57=12,58=7,59=169,60=318,61=1,62=1,63=1,64=1898,65=15,66=1,67=5,68=4,69=4,70=46,71=11,72=98,73=4,74=3,75=5,76=3,77=3,78=7,79=7,80=13,81=18,82=20,83=57,84=25,85=18,86=19,87=12,88=1900,89=9,90=3,91=6,99=6,100=44,101=2,107=4,108=32,112=2,114=12,115=4,117=32,123=4,125=2,127=32,128=5,131=5,138=2,139=4,141=32,147=6,151=2,155=4,157=4,163=4,164=2,167=2,171=4,176=32,177=2,179=4,182=12,187=4,190=2,195=4,203=6,211=4,213=2,217=2,219=4,223=2,227=4,231=2,233=2,235=4,243=6,244=32,245=2,251=16,>=256=2355 db0:keys=1153,expires=0

If it is a crash, can you please paste the stack trace that you can find in the log file or on standard output? This is really useful for us!

Please provide any additional information below. Using loglevel DEBUG on both master and slave, there is no log entry at the time that the "long ping" is initiated by the client, and there are log entries within a few seconds on either side of that time on both master and slave.

At one point when the client was waiting for a response, I was able to immediately connect to both master and slave via redis-cli and get the info dumps, so neither process was blocked. I've also been watching top and vmstat on both boxes -- no high cpu, no swapping, nothing out of the ordinary during the event.

Looking at packets on the machine that the master is running, the ping packets just stop during the event (which is expected, since the ping is blocking on the client side).

From the following, it doesn't appear that the master sees that the connection to the slave is down, but the slave detects it is down and reconnects:

Master: [19584] 14 Apr 10:41:47 - 1 clients connected (1 slaves), 1740336 bytes in use [19584] 14 Apr 10:41:52 - DB 0: 1152 keys (0 volatile) in 2048 slots HT. [19584] 14 Apr 10:41:52 - 1 clients connected (1 slaves), 1740336 bytes in use [19584] 14 Apr 10:41:53 - Accepted 10.0.0.6:56416 [19584] 14 Apr 10:41:53 * Slave ask for synchronization [19584] 14 Apr 10:41:53 * Starting BGSAVE for SYNC [19584] 14 Apr 10:41:53 * Background saving started by pid 19876 [19876] 14 Apr 10:41:53 * DB saved on disk [19584] 14 Apr 10:41:53 * Background saving terminated with success [19584] 14 Apr 10:41:53 * Synchronization with slave succeeded [19584] 14 Apr 10:41:53 - Client closed connection [19584] 14 Apr 10:41:57 - DB 0: 1152 keys (0 volatile) in 2048 slots HT. [19584] 14 Apr 10:41:57 - 1 clients connected (1 slaves), 1740336 bytes in use [19584] 14 Apr 10:42:02 - DB 0: 1152 keys (0 volatile) in 2048 slots HT.

Slave: [8561] 14 Apr 10:41:46 - 2 clients connected (0 slaves), 1572280 bytes in use [8561] 14 Apr 10:41:51 - DB 0: 1152 keys (0 volatile) in 3072 slots HT. [8561] 14 Apr 10:41:51 - 2 clients connected (0 slaves), 1572280 bytes in use [8561] 14 Apr 10:41:52 # MASTER time out: no data nor PING received... [8561] 14 Apr 10:41:52 * Connecting to MASTER... [8561] 14 Apr 10:41:52 * MASTER <-> SLAVE sync started: SYNC sent [8561] 14 Apr 10:41:52 * MASTER <-> SLAVE sync: receiving 147029 bytes from master [8561] 14 Apr 10:41:52 * MASTER <-> SLAVE sync: Loading DB in memory [8561] 14 Apr 10:41:52 . LOADING OBJECT 2 (at 22) [8561] 14 Apr 10:41:52 . LOADING OBJECT 2 (at 44)

Sorry for the long post, but this is a real head-scratcher. No VM on either box. The only other thing that's non standard is that I have activerehashing turned off on the slave. slave-serve-stale-data=yes.

What is the connection timeout for the master/slave connection? Can it be adjusted? What is the threshold for detecting a disconnected master by the slave--is it a single missing ping, or multiples?

Cheers!

-Dean

Comment #1

Posted on Apr 14, 2011 by Massive Monkey

One more piece of data -- in reference to the log posts at the bottom, the "long ping" was issued at 10:40:44, and returned at 10:42:31. Both processes seem to be up and happy during that time.

Thanks for any insights!

-Dean

Comment #2

Posted on Apr 19, 2011 by Massive Monkey

Still looking into root cause for this issue. While using redis-cli, I noticed that a simple DEL command appeared to hang, so I quickly fired up another redis-cli connection to the same server to get an info dump. The second session connected normally and returned INFO results without delay (see below).

So what I'm seeing is that the master just ignores one connected client for a long period of time (107.3 seconds, consistently), during which other clients are not blocked (at least clients that connect after the blocking starts). There is nothing suspicious in the master's logfile (set to DEBUG level logging).

I'm less convinced that this is related to the master/slave configuration, but still digging...any suggestions as to what else to look for are welcome!

INFO from Master:

redis_version:2.2.4 redis_git_sha1:00000000 redis_git_dirty:0 arch_bits:64 multiplexing_api:epoll process_id:19584 uptime_in_seconds:428064 uptime_in_days:4 lru_clock:299559 used_cpu_sys:23.18 used_cpu_user:41.58 used_cpu_sys_childrens:18.78 used_cpu_user_childrens:6.35 connected_clients:26 connected_slaves:1 client_longest_output_list:0 client_biggest_input_buf:0 blocked_clients:1 used_memory:4056784 used_memory_human:3.87M used_memory_rss:5447680 mem_fragmentation_ratio:1.34 use_tcmalloc:0 loading:0 aof_enabled:0 changes_since_last_save:0 bgsave_in_progress:0 last_save_time:1303229826 bgrewriteaof_in_progress:0 total_connections_received:7505 total_commands_processed:112278 expired_keys:391 evicted_keys:0 keyspace_hits:78584 keyspace_misses:16666 hash_max_zipmap_entries:64 hash_max_zipmap_value:512 pubsub_channels:0 pubsub_patterns:0 vm_enabled:0 role:master allocation_stats:2=2035,6=1,7=1,8=28229,9=358775,10=34286,11=11109,12=38353,13=102671,14=15433,15=60355,16=1236807,17=382380,18=18275,19=49790,20=4894,21=4904,22=6028,23=4091,24=64686,25=2708,26=4169,27=1133,28=6,29=2173,30=421,31=13,32=78609,33=17,34=5,35=240,36=28,37=465,38=848,39=2372,40=712,41=4741,42=170,43=4144,44=2,45=55,46=19,47=300,48=38067,49=6172,50=1768,51=1139,52=328,53=14,55=2024,56=3501,57=19,58=54,59=261,60=161,61=34,62=1,63=12,64=1117,65=13,66=1,67=224,68=113,69=1355,70=1750,71=337,72=1798,73=26,74=3,75=246,76=3,77=21,78=7,79=176,80=50,81=44,82=711,83=1139,84=105,85=23,86=86,87=45,88=8461,89=9,90=10,91=339,93=13,96=1359,97=13,98=10,99=889,100=1240,101=206,102=17,103=2225,104=9,105=3,106=577,107=905,108=179,109=129,110=12,111=17,112=503,113=525,114=851,115=183,116=29,117=343,118=46,119=11,120=499,121=373,122=118,123=95,124=23,125=33,126=300,127=211,128=256,129=48,130=56,131=282,132=109,133=1360,134=10,135=155,136=261,137=66,138=1242,139=104,140=18,141=50,142=29,143=297,144=42,145=33,146=12,147=49,148=54,149=48,150=11,151=56,152=20,153=10,154=12,155=52,156=8,157=12,158=8,159=7,160=13,161=36,162=191,163=198,164=17,165=21,166=4,167=37,168=1,169=2,170=111,171=33,172=1,173=5,174=7,175=36,176=91,177=148,178=33,179=50,180=5,181=80,182=33,183=18,184=72,185=12,186=24,187=40,188=11,189=7,190=18,191=12,192=158,193=13,194=4,195=52,198=1,199=1941,200=1,203=65,205=11,206=160,207=9,208=2,209=66,210=344,211=77,212=10,213=11,214=4,215=15,216=29,217=36,218=9,219=91,220=14,221=14,222=37,223=28,224=166,225=134,226=7,227=72,228=11,229=71,230=95,231=287,232=453,233=23,234=26,235=94,236=41,237=79,238=41,239=189,240=5,241=7,242=4,243=67,244=60,245=48,246=59,247=21,248=32,249=61,250=12,251=59,252=32,253=30,254=171,255=19,>=256=56742 db0:keys=2981,expires=0

Comment #3

Posted on Apr 21, 2011 by Massive Monkey

I've traced this back to a packet loss issue between nodes. It has nothing to do with Redis. Please close the ticket.

Comment #4

Posted on Apr 21, 2011 by Happy Elephant

Dean, sorry for not replying earlier here. Good to see you were able to resolve this.

Status: Invalid

Labels:
Type-Defect Priority-Medium