My favorites | Sign in
Project Home Downloads Wiki Issues Source
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 192: Rare total failure caused by input queue full
27 people starred this issue and may be notified of changes. Back to list
Status:  Accepted
Owner:  michael....@gmail.com


Sign in to add a comment
 
Reported by fiz...@gmail.com, Aug 1, 2011
What version of the product are you using? On what operating system?
Memcache 2.7 on Linux (EC2 instance using the Ubuntu Maverick AMI ami-08f40561)

Tell me more...

I think that there is a state the memcache client can get into where all network traffic ceases and all calls hang for ten seconds, then throw IllegalStateException.

I got our program into this state twice over the weekend by running some aggressive loadtests with some memcache servers that were ~ 0.5ms away from the client machine; I have not since then been able to reproduce this failure in an environment where I could hook up a debugger to see what's going on.

The observed exception is this:

java.lang.IllegalStateException: Timed out waiting to add net.spy.memcached.protocol.ascii.StoreOperationImpl@727b65b4(max wait=10000ms)
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:292)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:739)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:719)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:410)
at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:447)
at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:868)
... (our code after this)

Meanwhile, the IO thread is sitting waiting for the Selector to select something. (at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:252) The IO thread never wakes up.

Once the client enters this state, it's all over: every memcache client operation times out because the input queue is full. Because the add call in TCPMemcachedNodeImpl can't add to the queue, the selector is never woken up, and the client never recovers. The only thing to do in this case is shutdown() the current memcache client and throw it away.

It might be possible to fix this with a judicious application of finally inside MemcachedConnection.addOperation().
Aug 19, 2011
#2 mewmewb...@gmail.com
I'm reliably observing the same thing in my load test, though I'm on version 2.6rc1. What I see is that both the write queue and input queues are filled up (we have 2000 for write queue and 1000 for input queue), and then everything just seem to stop (queues no longer draining, etc.).
Aug 25, 2011
#3 mewmewb...@gmail.com
I saw this again in my load test. The queue sizes are stuck to:

input queue 20000
read queue 6
write queue 98

...and they are no longer moving at all, even when I removed load altogether.
Sep 19, 2011
#4 vishy.ka...@gmail.com
We saw this issue with 2.7.1 under heavy load test.
Sep 20, 2011
Project Member #5 mikewie...@gmail.com
Can someone post a test that can be used to reproduce this issue?
Sep 27, 2011
#6 a...@one2team.com
I've got the same problem in my production application.

memcached server 1.4.4-3.el6 runs on a Red Hat Enterprise Linux Server release 6.1
and the client (v2.7) runs on a tomcat 6 in a CentOS release 4.8


java.lang.IllegalStateException: Interrupted while waiting to add net.spy.memcached.protocol.ascii.GetOperationImpl@53cb966
at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:298)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:739)
at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:719)
at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:410)
at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:974)
at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:988)
... our code after this


I fail to reprocude it in test case event while querying on a local or a distant server..

Could i suggest to add the Exception root cause in the IllegalStateException
which is thrown in the TCPMemcachedNodeImpl.addOp(Operation) method to continue investigation ?

Two calls in the addOp method could be interrupted, the authLatch.await() and the inputQueue.offer().
Statistically, I think it's authLatch.await() call that is interrupted, but cannot be sure without the InterruptedException initialy thrown.

i.e. :

	try {
	//.....
		if(!inputQueue.offer(op, opQueueMaxBlockTime,
				TimeUnit.MILLISECONDS)) {
			throw new IllegalStateException("Timed out waiting to add "
					+ op + "(max wait=" + opQueueMaxBlockTime + "ms)");
		}
	} catch(InterruptedException e) {
		// Restore the interrupted status
		Thread.currentThread().interrupt();
		throw new IllegalStateException("Interrupted while waiting to add " + op, e);
	}
Oct 15, 2011
Project Member #7 ingen...@gmail.com
It's possible this has been fixed with some thread safety issues uncovered recently.  I think it makes sense to add the additional exception information as requested.
Status: Accepted
Owner: ingen...@gmail.com
Nov 24, 2011
#8 a...@one2team.com
Is the 2.7.3 release should take advantage of these ?
Nov 24, 2011
Project Member #9 ingen...@gmail.com
I haven't made the change yet.  please post a changeset to review.couchbase.com or send me a patch. 
Dec 7, 2011
#10 smilingai2004@gmail.com
I got the same problem in an application with spymemcached 2.5 and memcached 1.4.3. I can easily reproduce the issue by changing a few classes:
1) Reduce op queue length and queue block time (DefaultConnectionFactory.java).
2) Throw OutOfMemoryError in handleIO() (MemcachedConnection.java)
3) write a benchmark program to fetch and push data (InputQueueStressTest)
Run the benchmark program and A large amount of IllegalExceptions are seen in console. In the meanwhile, memcached thread stops.

OutOfMemoryError may happen in some situations. The problem is, after system recovers, how memcached thread is initiated again. So can MemcachedClient ignore system error?
		while(running) {
			try {
				conn.handleIO();				
			} catch(IOException e) {
				logRunException(e);
			} catch(CancelledKeyException e) {
				logRunException(e);
			} catch(ClosedSelectorException e) {
				logRunException(e);
			} catch(IllegalStateException e) {
				logRunException(e);
			}  catch (Error e) { // ignore it. System may recover soon or later.
			    
			}

ChangedClasses.rar
11.7 KB   Download
InputQueueStressTest.java
2.7 KB   View   Download
Dec 7, 2011
#11 smilingai2004@gmail.com
The root cause is same with Issue 191, although symptom is different. I have comment on issue 191 and request to escalate priority.
Dec 7, 2011
#12 mewmewb...@gmail.com
Hmm, issue 191 said its fixes are in 2.7.3, and we've upgraded to 2.7.3 but are still seeing these "total failures".
Dec 7, 2011
#13 smilingai2004@gmail.com
It is not said "for sure". OutOfMemoryError may happen under rare conditions. I have seen it in our production. In the meanwhile, memcached thread dies and there is no recovery mechanism when it happens. So I think MemcachedClient.run() should catch Throwable and log specific error information.
Dec 8, 2011
Project Member #14 ingen...@gmail.com
Yes, jonathanz's recommendations there I would like to implement.

The thing is, the behavior wouldn't necessarily be that much different.  If we're in OOM, we'll just log a lot and not get real work done with jonathanz's recommended change.  

I'll probably get to this in a few weeks.  If anyone has more time in the near term to get a patch together, I'd gladly accept it.
Dec 8, 2011
#15 smilingai2004@gmail.com
I have a patch based on 2.7.3. Can you get it into V2.8 preview?
MemcachedClient.java
74.1 KB   View   Download
Dec 8, 2011
Project Member #16 mikewie...@gmail.com
Can you post a patch by using git patch. It's difficult to see what you changed here.
Dec 8, 2011
Project Member #17 mikewie...@gmail.com
You can also post your code to our code review site at review.couchbase.org. Instructions for doing this are here: http://www.couchbase.org/wiki/display/membase/Contributing+changes
Dec 11, 2011
#18 mountain...@gmail.com
Post patch
fix_full_inputqueue.patch
1.9 KB   View   Download
Feb 16, 2012
#19 joshg...@gmail.com
Did this fix make it into a release yet?
Feb 16, 2012
Project Member #20 ingen...@gmail.com
It did not yet, will try to look into this soon and provide a snapshot.  did you hit this issue?
Feb 17, 2012
#21 joshg...@gmail.com
I "think" we are encountering this issue (if it isn't it seems very close).  We have been struggling with a condition in which we run into a cascade of the IllegalStateExceptions in our production environment.  We haven't pieced together the conditions that cause the cause the issue so are hoping someone can pinpoint the issue better than we have so far.

The stacktrace we see repeated while this is occuring:

java.lang.IllegalStateException: Timed out waiting to add
net.spy.memcached.protocol.binary.GetOperationImpl@7cb95c23(max wait=10000ms)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:292)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:611)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:591)
        at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:279)
        at net.spy.memcached.MemcachedClient.asyncGet(MemcachedClient.java:799)
        at net.spy.memcached.MemcachedClient.get(MemcachedClient.java:919)
Feb 17, 2012
Project Member #22 ingen...@gmail.com
No, that's different, I believe.

The stack trace there indicates you're trying to do an additional asynchronous get, but the client is overloaded with asynchronous gets.  It looks like you've done some tuning there.  

I'd recommend taking this to the mailing list.  If you can send over how you're setting up the client and a description of the workload, I'll look out for it there to try to help.
Feb 23, 2012
#23 joshg...@gmail.com
Thanks Matt.  I posted a question to the mailing list to see if I can get guidance on how to avoid this condition.  Thanks for the help!
May 13, 2014
#24 kitt...@gmail.com
Hello. Are there any updates ? We are seeing this on 2.7.3:

Caused by: java.lang.IllegalStateException: Timed out waiting to add net.spy.memcached.protocol.ascii.StoreOperationIm
pl@5da920ca(max wait=0ms)
        at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:294)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:750)
        at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:730)
        at net.spy.memcached.MemcachedClient.addOp(MemcachedClient.java:450)
        at net.spy.memcached.MemcachedClient.asyncStore(MemcachedClient.java:487)
        at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:908)
        at net.spy.memcached.MemcachedClient.set(MemcachedClient.java:114)
        at net.spy.memcached.PooledMemcachedClient.set(PooledMemcachedClient.java:104)

May 13, 2014
#25 kitt...@gmail.com
Once that happens, the client does not work properly anymore.
May 13, 2014
Project Member #26 michael....@gmail.com
Can you see if it works with the latest version? 2.11.2 .. your is quite dated by now

http://search.maven.org/#artifactdetails%7Cnet.spy%7Cspymemcached%7C2.11.2%7Cjar
Jul 1, 2014
#27 js...@photobucket.com
I believe I'm getting the same error in 2.11.2. One or two of our production servers will get into this state every day or so resulting in a total lockup of the client.

Error message
java.lang.IllegalStateException: Timed out waiting to add Cmd: 0 Opaque: 10073687 Key: v2.silo_map.silo_id_22(max wait=10000ms)

Stack trace
  net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp (TCPMemcachedNodeImpl.java:362)

     net.spy.memcached.MemcachedConnection.addOperation (MemcachedConnection.java:1228)

     net.spy.memcached.MemcachedConnection.addOperation (MemcachedConnection.java:1194)

 net.spy.memcached.MemcachedConnection.enqueueOperation (MemcachedConnection.java:1150)

             net.spy.memcached.MemcachedClient.asyncGet (MemcachedClient.java:1045)

                  net.spy.memcached.MemcachedClient.get (MemcachedClient.java:1230)

                  net.spy.memcached.MemcachedClient.get (MemcachedClient.java:1257)
Jul 1, 2014
#28 js...@photobucket.com
The attached unit test should show how to reproduce the issue. Basically, set the queue sizes arbitrarily small (4 on my laptop) and then throw 5000 reads into Callable objects and throw them at an executor service.

Sometimes, you'll see this exception:
java.lang.IllegalStateException: Queue full
	at java.util.AbstractQueue.add(AbstractQueue.java:98) ~[na:1.7.0_51]
	at java.util.concurrent.ArrayBlockingQueue.add(ArrayBlockingQueue.java:283) ~[na:1.7.0_51]
	at net.spy.memcached.protocol.TCPMemcachedNodeImpl.getNextWritableOp(TCPMemcachedNodeImpl.java:251) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.protocol.TCPMemcachedNodeImpl.fillWriteBuffer(TCPMemcachedNodeImpl.java:220) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.handleWrites(MemcachedConnection.java:760) [spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.handleReadsAndWrites(MemcachedConnection.java:688) [spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:648) [spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:419) [spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1403) [spymemcached-2.11.2.jar:2.11.2]

Other times you'll see ours:
java.lang.IllegalStateException: Timed out waiting to add Cmd: 0 Opaque: 97 Key: test.key(max wait=10000ms)
	at net.spy.memcached.protocol.TCPMemcachedNodeImpl.addOp(TCPMemcachedNodeImpl.java:362) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.addOperation(MemcachedConnection.java:1228) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.redistributeOperation(MemcachedConnection.java:1012) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.redistributeOperations(MemcachedConnection.java:984) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.queueReconnect(MemcachedConnection.java:956) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.lostConnection(MemcachedConnection.java:599) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:668) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.handleIO(MemcachedConnection.java:419) ~[spymemcached-2.11.2.jar:2.11.2]
	at net.spy.memcached.MemcachedConnection.run(MemcachedConnection.java:1403) ~[spymemcached-2.11.2.jar:2.11.2]

I haven't dug any deeper to see if threads are hung or some sort of loop or race is achieved. But in the event of my unit test the program never terminates... Though it  looks like infinite retries:

2014-07-01 21:53:31 pool-1-thread-2 WARN  n.s.m.MemcachedConnection - Could not redistribute to another node, retrying primary node for test.key.
MemcachedDefectTest.java
2.3 KB   View   Download
Jul 1, 2014
Project Member #29 michael....@gmail.com
Could you try with 2.11.4? We made some improvements in that area
Jul 2, 2014
#30 js...@photobucket.com
Confirmed the same behavior exists in 2.11.4.
Jul 2, 2014
Project Member #31 ingen...@gmail.com
(No comment was entered for this change.)
Owner: michael....@gmail.com
Sign in to add a comment

Powered by Google Project Hosting