Export to GitHub

redis - issue #543

One of our Redis servers crashes frequently


Posted on May 5, 2011 by Quick Bear

What version of Redis you are using, in what kind of Operating System? Redis versions: 2.2.1/2.2.2/2.2.4 System: Red Hat Enterprise Linux Server release 5.5

What is the problem you are experiencing? We have dozens of Redis in 3 servers. Every service is robust, except one crashes frequently. We have tested 2.2.1/2.2.2/2.2.4, none solves this problem.

What steps will reproduce the problem? Not very clearly. It seems to crash randomly.

Do you have an INFO output? Please past it here.

If it is a crash, can you please paste the stack trace that you can find in the log file or on standard output? This is really useful for us! [25846] 01 May 20:46:48 # ======= Ooops! Redis 2.2.4 got signal: -11- ======= [25846] 01 May 20:46:48 # redis_version:2.2.4 redis_git_sha1:00000000 redis_git_dirty:0 arch_bits:64 multiplexing_api:epoll process_id:25846 uptime_in_seconds:775965 uptime_in_days:8 lru_clock:401976 used_cpu_sys:12823.11 used_cpu_user:9897.50 used_cpu_sys_childrens:0.00 used_cpu_user_childrens:0.00 connected_clients:12 connected_slaves:0 client_longest_output_list:1 client_biggest_input_buf:0 blocked_clients:0 used_memory:3046435192 used_memory_human:2.84G used_memory_rss:3699609600 mem_fragmentation_ratio:1.21 use_tcmalloc:0 loading:0 aof_enabled:0 changes_since_last_save:98054217 bgsave_in_progress:0 last_save_time:1303478043 total_connections_received:28092933 total_commands_processed:1684214170 expired_keys:278925 evicted_keys:0 keyspace_hits:1674902439 keyspace_misses:7533554 hash_max_zipmap_entries:512 hash_max_zipmap_value:512 pubsub_channels:0 pubsub_patterns:0 vm_enabled:0 role:master allocation_stats:2=6449197,6=139,7=79,8=4322558,9=614635210,10=657357700, [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(ziplistNext+0x1d) [0x4141cd] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(ziplistNext+0x1d) [0x4141cd] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(listTypeNext+0xa0) [0x41da20] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(listTypeConvert+0x66) [0x41dad6] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(listTypePush+0xea) [0x41e03a] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(pushGenericCommand+0xac) [0x41e83c] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(call+0x23) [0x40db13] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(processCommand+0x202) [0x40ddc2] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(processInputBuffer+0x54) [0x415f14] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(readQueryFromClient+0x6b) [0x415fdb] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(aeProcessEvents+0x173) [0x40aa83] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(aeMain+0x2e) [0x40ac9e] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(main+0xed) [0x40f84d] [25846] 01 May 20:46:48 # /lib64/libc.so.6(__libc_start_main+0xf4) [0x3789e1d994] [25846] 01 May 20:46:48 # /home/redis3/bin/redis-server [0x40a079]

Please provide any additional information below.

Comment #1

Posted on May 5, 2011 by Happy Elephant

Thanks for the report. The stack trace shows that this happens while the list is being converted from a ziplist to a linked list. Can you tell some more about the average length of your lists and length of the elements inside it? Also, could you provide INFO output and a dump of the configuration file? I'm investigating in the mean time.

Comment #2

Posted on May 5, 2011 by Happy Elephant

You mention the crashes happen randomly, does this mean with the same stack trace or not? When one of the three machines is showing problems, while every machine is subject to the same access patterns, this might hav something to do with corrupt memory of something like that. Do the machines have ECC memory? When the stack trace is identical across crashes, we can assume there's a bug in the code somewhere and can work from there. Could you post these details?

Ps. Never mind the INFO output: I overlooked it for the previous post, and went immediately to the stack trace ;-).

Comment #3

Posted on May 5, 2011 by Happy Camel

thanks for ur answer, im the nuaaknighte's colleague who works with him on Redis, and we guess the problem might be that when the lengh of a list over the conf which we set, then it should be unzipped to a normal date type which u mentioned in the doc, and then because of some op this list became shorter than the conf it will be zipped again, the problem happened. thats only our guess, and i hope it will help u to identify and to solve it.

Comment #4

Posted on May 5, 2011 by Grumpy Dog

It seems like there is a problem in the ziplist implementation or in the conversion from ziplist to normal list, but to confirm this it will be very useful to have some past back trace as Pieter suggested. If the stack trace always looks similar we can narrow our investigation to the ziplist implementation and/or to the ziplist -> linkedlist conversion.

Thanks for your help! Salvatore

Comment #5

Posted on May 5, 2011 by Happy Camel

there is another thing i have to mention, nuaaknighte said 'It seems to crash randomly.' he means the time of it happened is randomly but not the op commands. actually, last time (yesterday)before it crashed we have dumped the data, but during we restore the data it crashed again. And we tried several times to restore the data, but it crashed everytime when it restored in similer percent, so we guess there is might be some special commands cause it to be crashed, in the last, thank u for reading it again.

Comment #6

Posted on May 5, 2011 by Grumpy Dog

Give us this dump that cause the crash and we can fix the whole problem in minutes :)

Comment #7

Posted on May 5, 2011 by Happy Camel

We have tested 2.2.1/2.2.2/2.2.4, none solves this problem. but the only one server crashed. the different between them is: only the crashed redis has to operate overlength list.

Comment #8

Posted on May 5, 2011 by Happy Camel

T_T badly, we dont have that dump file which cause the crash because it was overwrited after we restart the redis...

Comment #9

Posted on May 5, 2011 by Grumpy Dog

ok it was helpful but not a 'sine qua none' condition to solve the problem. Will just be a bit harder. Please can you send use the exact redis-server 2.2.4 binary that generated the stack trace you posted above? This can be helpful in order to turn the offsets in the stack trace into source code lines.

Comment #10

Posted on May 5, 2011 by Grumpy Dog

Pieter: I just disassembled the same version in the Linux box and I think I got the same offsets of our user's binary. The segfault appears to be here:

628 p = ZIPLIST_ENTRY_TAIL(zl); 629 return (p[0] == ZIP_END) ? NULL : p;

Basically p points to an invalid address, so p[0] causes the exception:

0x0000000000416320 : push %rbx 0x0000000000416321 : mov %rsi,%rbx 0x0000000000416324 : sub $0x20,%rsp 0x0000000000416328 : cmpb $0xff,(%rsi) 0x000000000041632b : je 0x416350 0x000000000041632d : mov %rsp,%rdi 0x0000000000416330 : callq 0x416140 0x0000000000416335 : mov 0xc(%rsp),%eax 0x0000000000416339 : add 0x10(%rsp),%eax 0x000000000041633d : lea (%rbx,%rax,1),%rax 0x0000000000416341 : cmpb $0xff,(%rax)

In the stack trace we have:

[25846] 01 May 20:46:48 # /home/redis3/bin/redis-server(ziplistNext+0x1d) [0x4141cd]

and ziplistNext+0x1d is what you see in the disassemble at ziplistNext+29

For some lame reason the stack trace uses hexadecimal offsets and gdb decimal ones... :)

Not enough to fix the problem but I wanted to show this info with you.

Salvatore

Comment #11

Posted on May 5, 2011 by Grumpy Dog

@nuaaknighte: please could you provide a list of commands used against lists?

For instance do you use LREM, LINSERT, LPUSH, RPUSH, ... or what? The list of all the commands used will be very helpful. Also another useful information is the kind of values you store inside the strings. Are numbers or strings? What is the average size of the string more or less? Thank you.

Salvatore

p.s. please if you obtain a new dump.rdb file that is able to crash the server save it!!! :)

Comment #12

Posted on May 5, 2011 by Grumpy Dog

Another thing we need is the output of:

redis-cli config GET '*'

thank you

Comment #13

Posted on May 5, 2011 by Grumpy Dog

Problem fixed in all the branches: 2.2, 2.4, unstable, 2.2-scripting, scripting.

Releasing 2.2.5 after adding the regression test.

Thanks to the original reporter of the problem and to Pieter Noordhuis for the actual fix!

Salvatore

Comment #14

Posted on May 10, 2011 by Happy Dog

I'm still seeing a very similar crash on 2.2.6:

May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: ======= Ooops! Redis 2.2.6 got signal: -11- ======= May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: redis_version:2.2.6#015#012redis_git_sha1:00000000#015#012redis_git_dirty:0#015#012arch_bits:32#015#012multiplexing_api:epoll#015#012process_id:25679#015#012uptime_in_seconds:1762#015#012uptime_in_days:0#015#012lru_clock:483061#015#012used_cpu_sys:0.95#015#012used_cpu_user:0.60#015#012used_cpu_sys_childrens:3.06#015#012used_cpu_user_childrens:0.58#015#012connected_clients:6#015#012connected_slaves:0#015#012client_longest_output_list:0#015#012client_biggest_input_buf:0#015#012blocked_clients:1#015#012used_memory:43397212#015#012used_memory_human:41.39M#015#012used_memory_rss:29003776#015#012mem_fragmentation_ratio:0.67#015#012use_tcmalloc:0#015#012loading:0#015#012aof_enabled:0#015#012changes_since_last_save:106#015#012bgsave_in_progress:0#015#012last_save_time:1305064608#015#012bgrewriteaof_in_progress:0#015#012total_connections_received:42#015#012total_commands_processed:2866#015#012expired_keys:0#015#012evicted_keys:0#015#012keyspace_hits:836#015#012keyspace_misses:1941#015#012hash_max_zipmap_entries:512#015#012hash_max_zipmap_value:64#015#012pubsub_channels:1#015#012pubsub_patterns:0#015#012vm_enabled:1#015#012role:master#015#012vm_conf_max_memory:270582939#015#012vm_conf_page_size:32#015#012vm_conf_pages:134217728#015#012vm_stats_used_pages:0#015#012vm_stats_swapped_ob May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(handleClientsWaitingListPush+0xb6) [0x8068346] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(pushGenericCommand+0x131) [0x8068ae1] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(lpushCommand+0x19) [0x8068b49] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(call+0x25) [0x80555e5] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(processCommand+0x260) [0x8055930] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(processInputBuffer+0x5f) [0x805e30f] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(readQueryFromClient+0x70) [0x805e400] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(aeProcessEvents+0x132) [0x80520a2] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(aeMain+0x37) [0x80522b7] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server(main+0x112) [0x8057752] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /lib/tls/i686/nosegneg/libc.so.6(__libc_start_main+0xe6) [0xb756bbd6] May 10 22:00:58 domU-12-31-38-00-95-53 redis[25679]: /usr/local/bin/redis-server() [0x80516a1]

$ uname -a Linux 2.6.32-312-ec2 #24-Ubuntu SMP Fri Jan 7 18:31:34 UTC 2011 i686 GNU/Linux

Attached are the server binary, the configuration used and the database as it was on disk after the crash.

Attachments

Comment #15

Posted on May 11, 2011 by Grumpy Dog

Thanks for reporting, this looks like a different issue. Do you use BLPOP or other blocking list operations? Thanks for everything you attached, this is useful.

Comment #16

Posted on May 11, 2011 by Grumpy Dog

The bug is now fixed, sorry for the inconvenient.

You can either use latest 2.2 branch on github, or simply download 2.2.6 and remove list 820 from t_list.c:

Where there is:

    unblockClientWaitingData(receiver);
    redisAssert(ln != listFirst(clients));

Make sure it will actually be just the first line:

    unblockClientWaitingData(receiver);
    // redisAssert(ln != listFirst(clients));

(In the above example I just commented the line, but you can remove it if you wish).

Redis 2.2.7 with this fix will be released today.

Comment #17

Posted on May 11, 2011 by Grumpy Dog

Sorry I mean "line 820" not "list 820"...

Status: New

Labels:
Type-Defect Priority-Medium