Issue 3013: Stream events abruptly stops with Gerrit 2.9.1
Status:  AwaitingInformation
Owner: ----
Reported by burmawal...@gmail.com, Nov 12, 2014
************************************************************
***** NOTE: THIS BUG TRACKER IS FOR GERRIT CODE REVIEW *****
***** DO NOT SUBMIT BUGS FOR CHROME, ANDROID, INTERNAL *****
***** ISSUES WITH YOUR COMPANY'S GERRIT SETUP, ETC.    *****
***** THOSE ISSUE BELONG IN DIFFERENT ISSUE TRACKERS!  *****
************************************************************

Affected Version:

What steps will reproduce the problem?
1. The stream events command abruptly stops notifying the events
2. The problem gets resolved by restarting Gerrit however the events stream stops again within few hours of restart.


What is the expected output? What do you see instead?
The stream events should send notifications without any problems.

Please provide any additional information below.
Every time stream events stops responding and we stop Gerrit, we are see the stack trace below in the log:

WARN  com.google.gerrit.server.git.WorkQueue$Lifecycle : Failed to stop
com.google.inject.ProvisionException: Guice provision errors:

1) Error in custom provider, com.google.inject.OutOfScopeException: Not in command/request
  while locating com.google.gerrit.sshd.SshScope$ContextProvider
  while locating com.google.gerrit.sshd.SshScope$Context

1 error
        at com.google.inject.internal.InjectorImpl$3.get(InjectorImpl.java:1014)
        at com.google.gerrit.sshd.SshLog.onExecute(SshLog.java:154)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.log(CommandFactoryProvider.java:212)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline.access$800(CommandFactoryProvider.java:104)
        at com.google.gerrit.sshd.CommandFactoryProvider$Trampoline$2.onExit(CommandFactoryProvider.java:184)
        at com.google.gerrit.sshd.BaseCommand.onExit(BaseCommand.java:305)
        at com.google.gerrit.sshd.commands.StreamEvents.onExit(StreamEvents.java:138)
        at com.google.gerrit.sshd.commands.StreamEvents$3.cancel(StreamEvents.java:87)
        at com.google.gerrit.server.git.WorkQueue$Task.cancel(WorkQueue.java:321)
        at java.util.concurrent.ScheduledThreadPoolExecutor.onShutdown(ScheduledThreadPoolExecutor.java:366)
        at java.util.concurrent.ThreadPoolExecutor.shutdown(ThreadPoolExecutor.java:1393)
        at java.util.concurrent.ScheduledThreadPoolExecutor.shutdown(ScheduledThreadPoolExecutor.java:759)
        at com.google.gerrit.server.git.WorkQueue.stop(WorkQueue.java:154)
        at com.google.gerrit.server.git.WorkQueue.access$000(WorkQueue.java:51)
        at com.google.gerrit.server.git.WorkQueue$Lifecycle.stop(WorkQueue.java:66)
        at com.google.gerrit.lifecycle.LifecycleManager.stop(LifecycleManager.java:88)
        at com.google.gerrit.pgm.Daemon$2.run(Daemon.java:209)
        at com.google.gerrit.pgm.util.RuntimeShutdown$ShutdownCallback.run(RuntimeShutdown.java:86)
Caused by: com.google.inject.OutOfScopeException: Not in command/request
        at com.google.gerrit.sshd.SshScope.requireContext(SshScope.java:153)
        at com.google.gerrit.sshd.SshScope.access$300(SshScope.java:38)
        at com.google.gerrit.sshd.SshScope$ContextProvider.get(SshScope.java:118)
        at com.google.gerrit.sshd.SshScope$ContextProvider.get(SshScope.java:115)
        at com.google.inject.internal.ProviderInternalFactory.provision(ProviderInternalFactory.java:86)
        at com.google.inject.internal.BoundProviderFactory.provision(BoundProviderFactory.java:73)
        at com.google.inject.internal.ProviderInternalFactory.circularGet(ProviderInternalFactory.java:66)
        at com.google.inject.internal.BoundProviderFactory.get(BoundProviderFactory.java:63)
        at com.google.inject.internal.InjectorImpl$3$1.call(InjectorImpl.java:1005)
        at com.google.inject.internal.InjectorImpl.callInContext(InjectorImpl.java:1051)
        at com.google.inject.internal.InjectorImpl$3.get(InjectorImpl.java:1001)

Nov 12, 2014
Project Member #2 huga...@gmail.com
Your problem seems to be the same reported in this discussion[1].

There is a regression in sshd[2] which is causing the ssh thread to get stuck and when all the stream-events threads are stuck, then the stream-events command no longer receive events.

This problem is supposed be fixed in sshd 0.13.0. If you build latest stable-2.9 branch, the problem should be fixed since it include sshd 0.13.0[3].

This will eventually be released as 2.9.2.

Let us know if this fix your issue.

[1]https://groups.google.com/forum/#!searchin/repo-discuss/stream$20events/repo-discuss/4va1DH520to
[2]https://issues.apache.org/jira/browse/SSHD-348
[3]https://gerrit-review.googlesource.com/#/c/61353/
Nov 12, 2014
#3 burmawal...@gmail.com
Thanks for the details (very helpful).

I have built a custom version of Gerrit based on v2.9.1 plus the following patches,
SSHD: Update to 0.13.0
Bump SSHD Mina version to 2.0.8
Bump Bouncycastle version to 1.51
Update EncryptedContactStore to not use deprecated/removed methods

Running validation tests now.
Nov 14, 2014
Project Member #4 huga...@gmail.com
Any updates? Did it fix your problem?
Nov 18, 2014
Project Member #5 david.pu...@sonymobile.com
Yusuf, any update on this?
Nov 19, 2014
#6 burmawal...@gmail.com
Hi David,

Apologies for the delayed response but wanted to give myself some time to test these patches before declaring the issue to be fixed. After running the patched version for almost a week, I couldn't repro the bug but again there was no systematic way to repro the bug on v2.9.1 either.

Can you recommend some test cases that I should run in order to validate these patches?

Do we have a release date for v2.9.2 where these patches will be officially released?

Br,
Yusuf
Nov 19, 2014
Project Member #7 david.pu...@sonymobile.com
We don't experience the error, so we also don't know any way to reproduce it.

2.9.2 is pending on verification that the SSHD fixes the issues, and also there is a  fix for the primary key order that needs to be included.  Hoping it will be within the next couple of weeks.  2.10 should follow not long after.

Nov 29, 2014
#8 gustavo@gnustavo.com
We've upgraded yesterday from 2.9.1 to 2.9.2 to have this issue solved but it seems to be there still.

We've restarted Gerrit today at 2:35pm and the ssh connections are piling up since then. Now, at 5:11pm "lsof -n -i :29418|grep -c ESTABLISHED" tells me there are 77 connections in the ESTABLISHED state. They all connect to our Jenkins servers.

In the error_log I only see 25 messages like this since the restart:

[2014-11-29 16:58:41,689] WARN  com.google.gerrit.sshd.GerritServerSession : Exception caught
java.io.IOException: Connection reset by peer
	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
	at sun.nio.ch.IOUtil.read(IOUtil.java:197)
	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
	at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:302)
	at org.apache.mina.transport.socket.nio.NioProcessor.read(NioProcessor.java:45)
	at org.apache.mina.core.polling.AbstractPollingIoProcessor.read(AbstractPollingIoProcessor.java:694)
	at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:668)
	at org.apache.mina.core.polling.AbstractPollingIoProcessor.process(AbstractPollingIoProcessor.java:657)
	at org.apache.mina.core.polling.AbstractPollingIoProcessor.access$600(AbstractPollingIoProcessor.java:67)
	at org.apache.mina.core.polling.AbstractPollingIoProcessor$Processor.run(AbstractPollingIoProcessor.java:1121)
	at org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)

But these messages always ocurred with this frequency even before we upgraded from 2.8.4 to 2.9.1 last Wednesday, on 2014-22-26.

I'm attaching the file jstack.13374.gz produced with "jstack -F <pid>" to get thread dumps from Gerrit. I couldn't find any stack trace similar to the one described on https://issues.apache.org/jira/browse/SSHD-348 though. Is there a better way to get thread dumps? (I don't speak Java, sigh.)

PS: For the record, now, at 5:27pm, there are 83 connections in the ESTABLISHED state.
jstack.13374.gz
1.9 KB   Download
Dec 1, 2014
Project Member #9 huga...@gmail.com
The issue is/was stuck threads because the ssh library is waiting for a disconnected client to empty its buffer. This was causing a depletion of the stream event thread pool which lead to stream event stop.

What you are describing could be a normal behaviour. Jenkins Gerrit-Trigger use the stream-event command which only ends when you stop GT so it is normal that you have a lot of established ssh connections, one per Jenkins that use GT.

If you say that you have more connections than Jenkins instance with GT, there could be a problem on GT side. I remember we had some connection issues at one point with GT when using its connection watchdog feature but we fixed them and all the fixes are included in 2.12.0.

Dec 1, 2014
#10 gustavo@gnustavo.com
I'm discussing this in the mailing list. It seems that the problem is on the Jenkins side.

https://groups.google.com/d/msg/repo-discuss/NuFti4SVNQM/WAqBaV0bGFAJ