
xmemcached - issue #194
Spontaneous batches of TimeoutExceptions after few hours of running
There is a platform which uses xmemcached under small load (about 1000-2000 transactions per second). Everything works fine for some time, then in few hours there is a bunch (~100-200) of TimeoutException in the log:
2012-04-27 04:40:18,447 [ER-3] ERROR mem-api.cache : XMemcachedClientConnector.get() key:DRT:pcef1.Gx;1317897826;24601921, getBeforeCount:49206377, getAfterCount:49206368, getDeltaCount:-9, getLatencyMax:126, getLatencyAvg:0.21611751552156827 2012-04-27 04:40:18,448 [ER-3] WARN sm.generated.TestApplication : #### Unexpected exception caught: java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timed out(1000) waiting for operation at com.test.memcache.impl.XMemcachedClientConnector.get(XMemcachedClientConnector.java:91) at com.test.memcache.impl.MemCacheImpl.get(MemCacheImpl.java:223) at com.test.app.d2d.MemCacheManager.loadSessionState(MemCacheManager.java:55)
Then everything works fine for another few hours until next batch of TimeoutExceptions. I took tcpdump during TimeoutExceptions and according to it there are no packets towards repcached corresponding to timedout "get" operations. It means that xmemcached retains application get commands and doesn't send them to repcached.
Also I took thread dump during TimeoutExceptions. All application threads (ER-[number]) are waiting on latchWait():
"ER-8" prio=10 tid=0x00002aaabd0f5000 nid=0x43cb waiting on condition [0x00000000446d6000] java.lang.Thread.State: TIMED_WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000077f570b98> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:198) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1011) at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1303) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:253) at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2526) at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:574) at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:971) at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:929) at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:940) at net.rubyeye.xmemcached.XMemcachedClient.get(XMemcachedClient.java:962) at com.test.memcache.impl.XMemcachedClientConnector.get(XMemcachedClientConnector.java:61)
GC pauses before and during TimeoutExceptions are less than 1 second. Please find attached gc.log, application log, TCP dumps before and after TimeoutExceptions, thread dumps before and after TimeoutExceptions.
version of the product:
memcached 1.4.10 http://mdounin.ru/files/repcached-2.3.1-1.4.10.patch.gz xmemcached-1.3.6 OS: RHEL 5.5 Java 1.7u3 2 CPU E5462 @ 2.80GHz, 8 HW threads in total
- xmemcached-timeout.zip 2.2MB
Comment #1
Posted on May 6, 2012 by Happy LionI am sorry,i am on holiday right now.
Are u sure that when this exception thrown that repcached was running well? Or it may block at that moment?
From your description,i can't give you a reasonable explanation.
Comment #2
Posted on May 6, 2012 by Swift HippoNo, repcached is not blocked. If repcached was a problem, I would see in TCP dump outgoing GET requests and wouldn't see responses. Instead I see in TCP dump neither outgoing GET requests, no responses.
Also we replaced xmemcached by spymemcached, and everything works just fine. The issue is in xmemcached.
Comment #3
Posted on May 6, 2012 by Happy LionWhen you used spymemcached,you set the operation timeout still to be one second by future.get?
Comment #4
Posted on May 6, 2012 by Happy LionYou didn't see in TCP dump neither outgoing GET requests, no responses,but it didn't mean repcached was no problem.There is a situation that xmemcached could not send requests to repcached for some reason.
I don't know have you checked the JVM heap at that moment?
Comment #5
Posted on May 6, 2012 by Happy LionWhat's your xmemcached configuration,i will try to reproduce it on my machine.Thanks.
Comment #6
Posted on May 6, 2012 by Happy LionAnd did you use noreply operation,or some difference methods that you don't use in spymemcached?
Comment #7
Posted on May 6, 2012 by Swift Hippo"When you used spymemcached,you set the operation timeout still to be one second by future.get?" Yes, timeout is the same 1 second. We have a wrapper component on top of memcached client, which allows us to choose either xmemcached or spymemcached. Everything (call flow, configuration, ...) is the same.
"You didn't see in TCP dump neither outgoing GET requests, no responses,but it didn't mean repcached was no problem.There is a situation that xmemcached could not send requests to repcached for some reason." In this case why spymemcached is able to send requests, when xmemcached is not able?
"I don't know have you checked the JVM heap at that moment?" There is a gc.log in the attachment above. All stop-the-world pauses are much less than 1 second.
"And did you use noreply operation,or some difference methods that you don't use in spymemcached?" there is no difference from application point of view. In both xmemcached and spymemcached configurations the application is doing "GET" and "SET noreply" operations.
Comment #8
Posted on May 6, 2012 by Happy LionWhat's the meaning of this "2012-04-27 04:40:34,715 [ER-2] ERROR mem-api.cache : XMemcachedClientConnector.get() key:DRT:pcef1.Gx;1317897826;24602895, getBeforeCount:49208437, getAfterCount:49208149, getDeltaCount:-288, getLatencyMax:126, getLatencyAvg:0.21612556489373336"
is it means that get error value?
And are u using memcached text protocol or binary?
What's your key and value for example?
Comment #9
Posted on May 6, 2012 by Swift Hippo"What's the meaning of this" This is the code, it increments counters before and after "get" operation, also latency is mesaured:
public Object get(String key) {
try {
getBeforeCount.incrementAndGet();
long before = System.currentTimeMillis();
Object result = client.get(key);
long latency = System.currentTimeMillis() - before;
if( getLatencyMax.get() < latency ) {
getLatencyMax.set( latency );
}
getLatencySum.addAndGet( latency );
getAfterCount.incrementAndGet();
return result;
} catch (Exception e) {
long lGetBeforeCount = getBeforeCount.get();
long lGetAfterCount = getAfterCount.get();
log.error( "XMemcachedClientConnector.get() key:" + key + ", getBeforeCount:" + lGetBeforeCount
+ ", getAfterCount:" + lGetAfterCount + ", getDeltaCount:" + (lGetAfterCount - lGetBeforeCount)
+ ", getLatencyMax:" + getLatencyMax.get()
+ ", getLatencyAvg:" + ( (double) getLatencySum.get() / getAfterCount.get() ) );
throw new RuntimeException(e);
}
}
What's your key and value for example? "And are u using memcached text protocol or binary?" Text procotol, but the issue reproduces also with binary protocol.
"What's your key and value for example?" key: DRT:pcef1.Gx;1317897826;5360 value: 1%server;18050538;2187704472|2%server;18050538;2187704473|
Comment #10
Posted on May 6, 2012 by Happy LionDid you have any periodic task? If you have,were theses tasks executing while timeout exception happened?
Comment #11
Posted on May 6, 2012 by Swift HippoNo periodic tasks.
Comment #12
Posted on May 6, 2012 by Happy LionAs i know that spymemcached doesn't have noreply operation API,how do you use it?
Comment #13
Posted on May 6, 2012 by Happy LionI found that many " omf.overload-manager : System OverloadDetected event:" logs in system.log.
It that means your system is over load? What's the TPS of your set-noreply operations at that moment?
Xmemcached's noreply operation just puts set commands to a queue in connection,if there are many set-noreply requests at one time,the queue may be full and get operation could not be sent to memcached.
Comment #14
Posted on May 6, 2012 by Swift Hippo"As i know that spymemcached doesn't have noreply operation API,how do you use it?" There is a method net.spy.memcached.MemcachedClient: public Future set(String key, int exp, Object o) We call it and don't use "Future" object, so it is almost like "noreply" operation.
"I found that many " omf.overload-manager : System OverloadDetected event:" logs in system.log. It that means your system is over load?"
It means that incoming queue size for working thread pool reached 5000 threshold. And that happended after first TimeoutExceptions, that happened because of TimeoutExceptions (and not vice versa), because xmemcached was holding working threads for 1 second.
"What's the TPS of your set-noreply operations at that moment?"
it was 200 TPS, the same as TPS of "get" operations.
Comment #15
Posted on May 7, 2012 by Swift Hippo"What's your xmemcached configuration,i will try to reproduce it on my machine.Thanks."
//singleAdresList corresponds to "127.0.0.1:8091"
XMemcachedClientBuilder builder = new XMemcachedClientBuilder(singleAddressList);
builder.setSocketOption(StandardSocketOption.TCP_NODELAY, true);
builder.setConnectionPoolSize(8);
builder.setFailureMode(true);
MemcachedClient client = builder.build();
Comment #16
Posted on May 7, 2012 by Happy LionThank you.
I will try my best to re-produce this issue. And if you can,please use set instead of setWithNoReply in your application,and run it again to check if the issue would happen again.
I think the setWithNoReply method may be the root cause.
Comment #17
Posted on Jul 17, 2012 by Happy LionHi,it seems that this issue is caused by too many noreply operations.Because noreply opertions doesn't wait response from server,so it just be put into a queue in application memory and waited to be sent to memcached. If there are too many noreply operations in queue,it will block the other operations and make their response time too large.
In xmemcached 1.3.8, i've added a flow control mechanism for noreply operations, use setMaxQueuedNoReplyOperations method in builder or factory bean to set the maximum queued noreply operations in memory,the default value is based on your jvm maximum memory by -Xmx option.
If you still want to try xmemcached,you can try the 1.3.8 please,thanks.
Comment #18
Posted on Aug 3, 2012 by Swift HippoWe workarounded this issue by replacing "setWithNoReply" with "set".
"Hi,it seems that this issue is caused by too many noreply operations.Because noreply opertions doesn't wait response from server,so it just be put into a queue in application memory and waited to be sent to memcached. If there are too many noreply operations in queue,it will block the other operations and make their response time too large."
That is really strange, because in my use case there is the same number of "get" and "setWithNoReply" operations. "get" operation is waiting for response, while "setWithNoReply" operation doesn't wait, so "setWithNoReply" operation is expected to be faster. Instead we see that "setWithNoReply" operations stuck and "get" operations don't stuck.
"In xmemcached 1.3.8, i've added a flow control mechanism for noreply operations, use setMaxQueuedNoReplyOperations method in builder or factory bean to set the maximum queued noreply operations in memory,the default value is based on your jvm maximum memory by -Xmx option."
I think, that the root cause of the issue is a bug in concurrency in xmemcached, and flow control mechanism doesn't fix this bug.
Comment #19
Posted on Mar 15, 2013 by Helpful GiraffeWe too see this problem after a few hours of operating. We used 1.3.8 but did upgrade yesterday to 1.4.1, but this problem still exists.
java.util.concurrent.TimeoutException: Timed out(5000) waiting for operation at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2565) at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2016) at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2029)
We use this to create the connection(s)
MemcachedClientBuilder builder = new XMemcachedClientBuilder(AddrUtil.getAddresses(cacheConnection));
builder.setCommandFactory(new BinaryCommandFactory());//use binary protocol
_memcachedClient=builder.build();
and then use it that way:
if (_hasGetAndTouch) { // Unfortunally getAndTouch is not implemented on server side retVal= _mClient.getAndTouch(finalKey, MAX_CACHE_LIFETIME); } else { retVal= _mClient.get(finalKey); }
So as you can see, we have the problem on the getAndTouch command, and not on the set command.
Could it be, that the connection is idle too long and due to this is in some way half-closed by either the server or client ? The errors start the appear ~18 hours after we first make contact with the cache. And the connections we have established are hold open for reuse.
Comment #20
Posted on Mar 15, 2013 by Helpful GiraffeHere a url for more details on timeouts in memcached
Comment #21
Posted on Mar 15, 2013 by Happy Lion@andre
Have you tried all these solutions in this document http://code.google.com/p/memcached/wiki/Timeouts ?
If so,maybe it has an issue in xmemcached.I have to find it out.
If the connection is idle, xmemcached will send a 'version' command to server to make sure that connection is alive.If hearbeating failed, it will close the connection and try to heal it.
Comment #22
Posted on Mar 15, 2013 by Helpful GiraffeYes, but with no obvious errors found
- listen_disabled_num is not a problem
- the servers are all 64 bit, are not using swap and have low load
- dropping packets is not a issue either, since on one of the testservers we have no (local) firewall active
- TIME_WAIT is also not a problem
Comment #23
Posted on Mar 15, 2013 by Happy Lion@andre
do you disable the heartbeating?
And can you find some text like "heartbeat fail" in your log?
At last,please turn on you java GC log to check if it was doing full GC when timeout happend.
Comment #24
Posted on Mar 20, 2013 by Helpful GiraffeNo, I did neither enable nor disable heartbeating.
No, I don't see any logs with "hertbeat".
Here the error log of tomcat: 2013-03-20 03:24:28,837 [webseite2] [Thread-412] ERROR com.aarboard.aarcatcore.cache.xmemcached.XMemcacheCache - Error in cache for key: all java.util.concurrent.TimeoutException: Timed out(5000) waiting for operation at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2565) at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2016) at net.rubyeye.xmemcached.XMemcachedClient.getAndTouch(XMemcachedClient.java:2029) at com.aarboard.aarcatcore.cache.xmemcached.XMemcacheCache.getEntry(XMemcacheCache.java:73) at
And the corresponding GC logs
2013-03-20T03:24:00.628+0100: 69631.200: [GC 69631.200: [ParNew: 26240K->674K(29504K), 0.0366990 secs] 1563479K->1537913K(1569600K) icms_dc=100 , 0.0373240 secs 2013-03-20T03:24:00.907+0100: 69631.478: [GC 69631.479: [ParNew: 26914K->442K(29504K), 0.0321190 secs] 1564153K->1537681K(1569600K) icms_dc=100 , 0.0325630 secs 2013-03-20T03:24:01.134+0100: 69631.705: [GC 69631.706: [ParNew: 26682K->605K(29504K), 0.0299910 secs] 1563921K->1537844K(1569600K) icms_dc=100 , 0.0304360 secs 2013-03-20T03:24:01.345+0100: 69631.917: [GC 69631.917: [ParNew: 26845K->794K(29504K), 0.0300840 secs] 1564084K->1538034K(1569600K) icms_dc=100 , 0.0305380 secs 2013-03-20T03:24:01.565+0100: 69632.137: [GC 69632.137: [ParNew: 27034K->743K(29504K), 0.0355330 secs] 1564274K->1538187K(1569600K) icms_dc=100 , 0.0361180 secs 2013-03-20T03:24:02.230+0100: 69632.802: [GC 69632.802: [ParNew: 26983K->713K(29504K), 0.0311850 secs] 1564427K->1538274K(1569600K) icms_dc=100 , 0.0316440 secs 2013-03-20T03:24:03.049+0100: 69633.621: [GC 69633.621: [ParNew: 26953K->937K(29504K), 0.0320370 secs] 1564514K->1538604K(1569600K) icms_dc=100 , 0.0325100 secs 2013-03-20T03:24:03.857+0100: 69634.429: [GC 69634.429: [ParNew: 27177K->782K(29504K), 0.0304030 secs] 1564844K->1538547K(1569600K) icms_dc=100 , 0.0308760 secs 2013-03-20T03:24:04.719+0100: 69635.290: [GC 69635.301: [ParNew: 27022K->1058K(29504K), 0.0350630 secs] 1564787K->1538913K(1569600K) icms_dc=100 , 0.0356500 sec 2013-03-20T03:24:05.602+0100: 69636.173: [GC 69636.174: [ParNew: 27298K->612K(29504K), 0.0350860 secs] 1565153K->1538577K(1569600K) icms_dc=100 , 0.0355900 secs 2013-03-20T03:24:06.443+0100: 69637.015: [GC 69637.015: [ParNew: 26852K->806K(29504K), 0.0329730 secs] 1564817K->1538881K(1569600K) icms_dc=100 , 0.0334430 secs 2013-03-20T03:24:07.279+0100: 69637.851: [GC 69637.851: [ParNew: 27046K->940K(29504K), 0.0326340 secs] 1565121K->1539122K(1569600K) icms_dc=100 , 0.0331870 secs 2013-03-20T03:24:08.102+0100: 69638.673: [GC 69638.674: [ParNew: 27180K->633K(29504K), 0.0307200 secs] 1565362K->1538902K(1569600K) icms_dc=100 , 0.0313650 secs 2013-03-20T03:24:08.951+0100: 69639.522: [GC 69639.523: [ParNew: 26873K->700K(29504K), 0.0307750 secs] 1565142K->1539057K(1569600K) icms_dc=100 , 0.0312460 secs 2013-03-20T03:24:09.197+0100: 69639.768: [CMS-concurrent-mark: 8.422/8.953 secs] [Times: user=12.14 sys=0.86, real=8.95 secs] 2013-03-20T03:24:09.197+0100: 69639.769: [CMS-concurrent-preclean-start] 2013-03-20T03:24:09.209+0100: 69639.781: [CMS-concurrent-preclean: 0.011/0.012 secs] [Times: user=0.02 sys=0.00, real=0.02 secs] 2013-03-20T03:24:09.210+0100: 69639.782: [CMS-concurrent-abortable-preclean-start] 2013-03-20T03:24:09.803+0100: 69640.375: [GC 69640.375: [ParNew: 26940K->884K(29504K), 0.0306980 secs] 1565297K->1539325K(1569600K) icms_dc=100 , 0.0413230 secs 2013-03-20T03:24:10.297+0100: 69640.868: [CMS-concurrent-abortable-preclean: 0.097/1.086 secs] [Times: user=0.50 sys=0.08, real=1.09 secs] 2013-03-20T03:24:10.298+0100: 69640.870: [GC[YG occupancy: 15521 K (29504 K)]69640.870: [Rescan (parallel) , 0.0418540 secs]69640.912: [weak refs processing, 0. 2013-03-20T03:24:10.342+0100: 69640.913: [CMS-concurrent-sweep-start] 2013-03-20T03:24:10.720+0100: 69641.292: [GC 69641.292: [ParNew: 27124K->1554K(29504K), 0.0325720 secs] 1565564K->1540080K(1569600K) icms_dc=100 , 0.0330720 sec 2013-03-20T03:24:11.148+0100: 69641.720: [GC 69641.720: [ParNew (promotion failed): 27794K->29504K(29504K), 0.0895560 secs]69641.810: [CMS2013-03-20T03:24:12.98 (concurrent mode failure): 1539703K->1539063K(1540096K), 14.0319360 secs] 1566320K->1539063K(1569600K), [CMS Perm : 141329K->141254K(266856K)] icms_dc=100 , 14 2013-03-20T03:24:25.344+0100: 69655.916: [GC [1 CMS-initial-mark: 1539063K(1540096K)] 1542126K(1569600K), 0.0301270 secs] [Times: user=0.03 sys=0.00, real=0.03 2013-03-20T03:24:25.377+0100: 69655.948: [CMS-concurrent-mark-start] 2013-03-20T03:24:30.158+0100: 69660.730: [GC 69660.730: [ParNew: 26240K->26240K(29504K), 0.0000450 secs]69660.730: [CMS2013-03-20T03:24:33.221+0100: 69663.792: (concurrent mode failure): 1539063K->1538864K(1540096K), 15.2470910 secs] 1565303K->1538864K(1569600K), [CMS Perm : 141299K->141296K(266856K)] icms_dc=100 , 15 2013-03-20T03:24:45.436+0100: 69676.007: [GC [1 CMS-initial-mark: 1538864K(1540096K)] 1539159K(1569600K), 0.0233200 secs] [Times: user=0.03 sys=0.00, real=0.03 2013-03-20T03:24:45.460+0100: 69676.032: [CMS-concurrent-mark-start] 2013-03-20T03:24:45.868+0100: 69676.440: [GC 69676.440: [ParNew: 26240K->2725K(29504K), 0.0297190 secs] 1565104K->1541589K(1569600K) icms_dc=100 , 0.0301970 sec 2013-03-20T03:24:46.153+0100: 69676.725: [GC 69676.725: [ParNew: 28965K->709K(29504K), 0.0317680 secs] 1567829K->1539573K(1569600K) icms_dc=100 , 0.0322110 secs 2013-03-20T03:24:46.427+0100: 69676.999: [GC 69676.999: [ParNew: 26949K->486K(29504K), 0.0298070 secs] 1565813K->1539351K(1569600K) icms_dc=100 , 0.0302500 secs 2013-03-20T03:24:46.693+0100: 69677.264: [GC 69677.265: [ParNew: 26726K->682K(29504K), 0.0302160 secs] 1565591K->1539547K(1569600K) icms_dc=100 , 0.0306790 secs 2013-03-20T03:24:46.934+0100: 69677.506: [GC 69677.506: [ParNew: 26922K->648K(29504K), 0.0307740 secs] 1565787K->1539645K(1569600K) icms_dc=100 , 0.0312550 secs
I noticed, that the first "(concurrent mode failure)" shows up at 2013-03-20T03:16:59.060+0100, and then it shows up more and more, until it is shown on each GC run.
Unfortunally I'm not sure what this means.... :(
Comment #25
Posted on Mar 20, 2013 by Happy LionHi, from your GC log,it seems that your application's full GC is paused too long:
[CMS-concurrent-mark: 8.422/8.953 secs]
The mark phrase spent 8 seconds,it stopped the all application threads(stop the whole world),so at this point,there would be many timeout exception in xmemcached.
And "concurrent mode failure" meant you JVM arguments has some problems, if it happened,the JVM could not do GC in CMS mode,but stop the whole world.More information you can see this blog:
https://blogs.oracle.com/jonthecollector/entry/the_unspoken_cms_and_printgcdetails
I recommend u to turn the GC arguments for you application.
Comment #26
Posted on Mar 22, 2013 by Helpful GiraffeThanks, so in my case it's "normal" (or better because of GC working) that we see the timeout exceptions
Comment #27
Posted on Nov 18, 2013 by Happy HorseHI. I think I'm facing the same problem of timeout. I'm using Xmemcached_1.4.1 and java 6 on Tomcat7 app serer. My test runs throughout the night, and performs some get, add, and delete operations with Memcached. The process makes many I/O operations but there is only one single client users, the CPU consumption is always under 30% and the max Memory Heap (1 GB) is never reached (max reach < 400Mb). (So the Full GC collection is not enough to explain the issue.) After several hours of test, the final operations are beginning, the CPU increase from 8% to 25%. In the log file I begin to see some Memcached operation timeout Exception: java.util.concurrent.TimeoutException: Timed out(5000 milliseconds) waiting for operation while connected to (myIpMemcached:portnĀ°) at net.rubyeye.xmemcached.XMemcachedClient.latchWait(XMemcachedClient.java:2617) at net.rubyeye.xmemcached.XMemcachedClient.fetch0(XMemcachedClient.java:617) at net.rubyeye.xmemcached.XMemcachedClient.get0(XMemcachedClient.java:1030) at net.rubyeye.xmemcached.XMemcachedClient.gets(XMemcachedClient.java:1043) at net.rubyeye.xmemcached.XMemcachedClient.gets(XMemcachedClient.java:1078 as you can see I'm using the default operation timeout 5000 milliseconds which is already very high in comparison with the typical turnaround time of Memcached (<2 ms). I also monitored the Memcached Server with a dedicated JVM and a java client that makes periodic get operations, I did not find any error during this monitoring, so I don't think the problems comes from the server but it probably comes from the XMemcached library and its behavior in the JVM during this test.
Status: New
Labels:
Type-Defect
Priority-Medium