Export to GitHub

memcached - issue #404

memcached 1.4.23 fails test binary_prependq on Solaris 10 Sparc 64 Bit


Posted on Apr 20, 2015 by Happy Camel

What steps will reproduce the problem?

Building memcached 1.4.23 and running the testsuite

What is the expected output? What do you see instead?

These are the last lines from the testsuite: ok 46 - binary_prepend ok 47 - binary_prependq Timeout.. killing the process Assertion failed: nr != 0, file testapp.c, line 725 Makefile:1482: recipe for target 'test' failed gmake: * [test] Abort (core dumped) gmake: Leaving directory '/home/dam/mgar/pkg/memcached/trunk/work/solaris10-sparc/build-isa-sparcv9/memcached-1.4.23'

What version of the product are you using? On what operating system? 1.4.23, Solaris 10 Sparc with Sun Studio 12

Please provide any additional information below.

Comment #1

Posted on Apr 20, 2015 by Happy Camel

There are also some different tests failing on Solaris 10 amd64:

slab class 42: chunk size 1048576 perslab 1 <24 server listening (binary) <27 server listening (binary) <28 send buffer was 57344, now 2097152 <32 send buffer was 57344, now 2097152 <31 server listening (udp) <28 server listening (udp) <30 server listening (udp) <29 server listening (udp) <35 server listening (udp) <32 server listening (udp) <34 server listening (udp) <33 server listening (udp) <36 new binary client connection. <36 connection closed. Invalid value for binding protocol: http -- should be one of auto, binary, or ascii t/00-startup.t ....... 17/18 Number of threads must be greater than 0 t/00-startup.t ....... ok
t/64bit.t ............ ok
t/binary-get.t ....... ok
t/binary-sasl.t ...... This server is not built with SASL support. t/binary-sasl.t ...... ok
t/binary.t ........... 1371/3639

Failed test at t/binary.t line 422.

got: '1024'

expected: '0'

t/binary.t ........... 3621/3639 # Looks like you failed 1 test of 3639. t/binary.t ........... Dubious, test returned 1 (wstat 256, 0x100) Failed 1/3639 subtests t/bogus-commands.t ... ok
t/cas.t .............. ok
t/daemonize.t ........ ok
t/dash-M.t ........... ok
t/evictions.t ........ ok
t/expirations.t ...... ok
t/flags.t ............ ok
t/flush-all.t ........ ok
t/getset.t ........... ok
t/incrdecr.t ......... ok
t/issue_104.t ........ ok
t/issue_108.t ........ ok
t/issue_14.t ......... ok
t/issue_140.t ........ skipped: Fix for Issue 140 was only an illusion t/issue_152.t ........ ok
t/issue_163.t ........ ok
t/issue_183.t ........ ok
t/issue_192.t ........ ok
t/issue_22.t ......... ok
t/issue_260.t ........ skipped: Only possible to test #260 under artificial conditions t/issue_29.t ......... ok
t/issue_3.t .......... ok
t/issue_41.t ......... ok
t/issue_42.t ......... ok
t/issue_50.t ......... ok
t/issue_61.t ......... ok
t/issue_67.t ......... ok
t/issue_68.t ......... ok
t/issue_70.t ......... ok
t/item_size_max.t .... 1/7 Item max size cannot be less than 1024 bytes. t/item_size_max.t .... 2/7 Cannot set item size limit higher than 128 mb. t/item_size_max.t .... 3/7 WARNING: Setting item max size above 1MB is not recommended! Raising this limit increases the minimum memory requirements and will decrease your memory efficiency. t/item_size_max.t .... ok
t/line-lengths.t ..... ok
t/lru-crawler.t ...... ok
t/lru-maintainer.t ... 11/224

Failed test 'moved some items to cold'

at t/lru-maintainer.t line 33.

got: '0'

expected: anything else

Looks like you failed 1 test of 224.

t/lru-maintainer.t ... Dubious, test returned 1 (wstat 256, 0x100) Failed 1/224 subtests t/lru.t .............. ok
t/maxconns.t ......... ok
t/multiversioning.t .. ok
t/noreply.t .......... ok
t/refhang.t .......... ok
t/slabs_reassign.t ... ok
t/stats-conns.t ...... ok
t/stats-detail.t ..... ok
t/stats.t ............ 1/97

Failed test at t/stats.t line 178.

got: '1024'

expected: '0'

Looks like you failed 1 test of 97.

t/stats.t ............ Dubious, test returned 1 (wstat 256, 0x100) Failed 1/97 subtests t/touch.t ............ ok
t/udp.t .............. ok
t/unixsocket.t ....... ok
t/whitespace.t ....... fatal: Not a git repository (or any parent up to mount point /home/dam) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). t/whitespace.t ....... skipped: Skipping tests probably because you don't have git.

Test Summary Report

t/binary.t (Wstat: 256 Tests: 3639 Failed: 1) Failed test: 1625 Non-zero exit status: 1 t/lru-maintainer.t (Wstat: 256 Tests: 224 Failed: 1) Failed test: 43 Non-zero exit status: 1 t/stats.t (Wstat: 256 Tests: 97 Failed: 1) Failed test: 74 Non-zero exit status: 1 Files=51, Tests=7496, 192 wallclock secs ( 3.70 usr 0.61 sys + 14.37 cusr 6.69 csys = 25.37 CPU) Result: FAIL

Comment #2

Posted on Apr 23, 2015 by Happy Bear

I'm seeing this on Arch Linux as well.

This is similar to an earlier issue I have seen in that I think it might have to do with hardware lock elision: https://groups.google.com/d/msg/memcached/Tw6t_W-a6Xc/lXgz8LQ_vS0J

It doesn't lock up on my much older x86_64 desktop machine, but it does lock up on our build box. gdb output below of the hung process during the binary_prependq test. Notice Thread 3 especially.

$ sudo gdb -p 10999 GNU gdb (GDB) 7.9 Copyright (C) 2015 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-unknown-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: http://www.gnu.org/software/gdb/bugs/. Find the GDB manual and other documentation resources online at: http://www.gnu.org/software/gdb/documentation/. For help, type "help". Type "apropos word" to search for commands related to "word". Attaching to process 10999 Reading symbols from /home/dan/svn-packages/memcached/trunk/src/memcached-1.4.23/memcached-debug...done. Reading symbols from /usr/lib/libevent-2.0.so.5...(no debugging symbols found)...done. Reading symbols from /usr/lib/libpthread.so.0...(no debugging symbols found)...done. [New LWP 11004] [New LWP 11003] [New LWP 11002] [New LWP 11001] [New LWP 11000] [Thread debugging using libthread_db enabled] Using host libthread_db library "/usr/lib/libthread_db.so.1". Reading symbols from /usr/lib/libc.so.6...(no debugging symbols found)...done. Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done. 0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6 (gdb) thread apply all backtrace

Thread 6 (Thread 0x7fad9dddd700 (LWP 11000)):

0 0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6

1 0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5

2 0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5

3 0x000000000041add4 in worker_libevent (arg=0x1f99480) at thread.c:379

4 0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0

5 0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 5 (Thread 0x7fad9d5dc700 (LWP 11001)):

0 0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6

1 0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5

2 0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5

3 0x000000000041add4 in worker_libevent (arg=0x1f9a5c8) at thread.c:379

4 0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0

5 0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 4 (Thread 0x7fad9cddb700 (LWP 11002)):

0 0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6

1 0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5

2 0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5

3 0x000000000041add4 in worker_libevent (arg=0x1f9b710) at thread.c:379

4 0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0

5 0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 3 (Thread 0x7fad9c5da700 (LWP 11003)):

0 0x00007fad9e19064c in __lll_lock_wait () from /usr/lib/libpthread.so.0

1 0x00007fad9e193090 in __lll_lock_elision () from /usr/lib/libpthread.so.0

2 0x00000000004183c9 in item_stats_totals (add_stats=add_stats@entry=0x4067d0 , c=c@entry=0x7fad940008c0) at items.c:506

3 0x0000000000414c57 in get_stats (stat_type=stat_type@entry=0x0, nkey=nkey@entry=0, add_stats=add_stats@entry=0x4067d0 , c=c@entry=0x7fad940008c0) at slabs.c:309

4 0x000000000040cee0 in process_bin_stat (c=0x7fad940008c0) at memcached.c:1514

5 complete_nread_binary (c=0x7fad940008c0) at memcached.c:2247

6 complete_nread (c=c@entry=0x7fad940008c0) at memcached.c:2293

7 0x00000000004101d0 in drive_machine (c=0x7fad940008c0) at memcached.c:4179

8 event_handler (fd=, which=, arg=0x7fad940008c0) at memcached.c:4386

9 0x00007fad9e3aeca6 in event_base_loop () from /usr/lib/libevent-2.0.so.5

10 0x000000000041add4 in worker_libevent (arg=0x1f9c858) at thread.c:379

11 0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0

12 0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 2 (Thread 0x7fad9bdd9700 (LWP 11004)):

0 0x00007fad9e18d9af in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0

1 0x000000000041a510 in assoc_maintenance_thread (arg=) at assoc.c:261

2 0x00007fad9e188374 in start_thread () from /usr/lib/libpthread.so.0

3 0x00007fad9dec627d in clone () from /usr/lib/libc.so.6

Thread 1 (Thread 0x7fad9e7f2700 (LWP 10999)):

0 0x00007fad9dec6883 in epoll_wait () from /usr/lib/libc.so.6

1 0x00007fad9e3c34d8 in ?? () from /usr/lib/libevent-2.0.so.5

2 0x00007fad9e3ae61a in event_base_loop () from /usr/lib/libevent-2.0.so.5

3 0x0000000000404eea in main (argc=0, argv=0x7fff82d49fd5) at memcached.c:5724

ok 47 - binary_prependq Timeout.. killing the process testapp: testapp.c:725: safe_recv: Assertion `nr != 0' failed. Makefile:1482: recipe for target 'test' failed make: * [test] Aborted

processor : 7 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz stepping : 3 microcode : 0x12 flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm xsaveopt

Comment #3

Posted on Apr 24, 2015 by Helpful Bird

Can you folks please try https://github.com/memcached/memcached (master branch) and ping this issue on if it repairs your problem completey? All tests should run.

Comment #4

Posted on Apr 24, 2015 by Happy Bear

I ran the same build/test/package steps that triggered the above, pulling the one (369845f086) patch from git and applying it, and we seem to be in business. All tests pass, no hangups or timeouts anymore.

Comment #5

Posted on Apr 24, 2015 by Happy Camel

I also just patched in the latest commit f086 to the tarball.

The test suite now passes on sparcv9 cleanly, the only issue on amd64 left is

t/lru-maintainer.t ... 11/224

Failed test 'moved some items to cold'

at t/lru-maintainer.t line 33.

got: '0'

expected: anything else

t/lru-maintainer.t ... 195/224 # Looks like you failed 1 test of 224. t/lru-maintainer.t ... Dubious, test returned 1 (wstat 256, 0x100) Failed 1/224 subtests

Comment #6

Posted on Apr 24, 2015 by Happy Camel

BTW, what happened to the buildfarm? https://code.google.com/p/memcached/wiki/BuildFarm The waterfall display is stuck for me.

As I already run a buildbot instance on our farm I can gladly provide some slaves on Solaris: http://buildfarm.opencsw.org/buildbot/waterfall

Comment #7

Posted on Apr 24, 2015 by Helpful Bird

buildfarm's been dead a long while... it requires quite a lot of time to keep maintained.

honkman42: That is failing a test under solaris amd64? Can you attach the output of "prove -v t/lru-maintainer.t" please? Does it fail every time or just sometimes?

Comment #8

Posted on Apr 25, 2015 by Helpful Bird

Another push to master for a workaround in the test. Solaris seems to not handle the background juggler's sleeps in the same way. I'll look into it more, but need to cut another release since the off by one is pretty serious.

Comment #9

Posted on Apr 25, 2015 by Helpful Bird

1.4.24 is out. please try if you haven't already.

Comment #10

Posted on Apr 27, 2015 by Happy Camel

Yes, 1.4.24 looks good, all tests pass on Solaris 10 Sparc + x86, both in 32 and 64 bit.

I also added memcached to my Solaris buildbot and 'master' with some more options enabled: https://buildfarm.opencsw.org/buildbot/waterfall?category=memcached Now t/whitespace.t is failing sporadically as you can see from the above logs. Should I open a separate bug report for that?

Also I noticed that some tests failed when I run 32 and 64 bit tests in parallel, I guess I need to serialize it per host?

If considered useful I could also enable notifications on build failure via mail or irc or such.

Thanks! -- Dago

Comment #11

Posted on Apr 27, 2015 by Helpful Bird

please open other issues for those things. you can have the thing e-mail me I guess? don't want it to spam a public place just yet.

Comment #12

Posted on Apr 27, 2015 by Happy Camel

Sure, I added your email to the list of notifiers. Just let me know if you need adjusted settings.

Status: Fixed

Labels:
Type-Defect Priority-Medium