My favorites | Sign in
Project Home Downloads Wiki Issues Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 254166: VMTest fails due to "Connection timed out due to banner exchange."
16 people starred this issue and may be notified of changes. Back to list
 
Project Member Reported by wi...@chromium.org, Jun 25, 2013
1) http://build.chromium.org/buildbot/chromiumos/builders/amd64%20generic%20full/builds/8471

2) http://build.chromium.org/buildbot/chromiumos/builders/x86%20generic%20incremental/builds/12316

This bug manifests as random commands timing out in the VMTests.  For instance:

For 1), login_CryptohomeIncognitoMounted failed, but it failed like:

06/25 11:17:08.412 DEBUG|  ssh_host:0114| Running (ssh) 'ls /usr/local/autotest/bin/autotest > /dev/null 2>&1'
06/25 11:17:38.458 ERROR|base_utils:0119| [stderr] Connection timed out during banner exchange
06/25 11:17:38.460 ERROR|server_job:0563| Exception escaped control file, job aborting:
Traceback (most recent call last):
  File "/build/amd64-generic/usr/local/autotest/server/server_job.py", line 555, in run
    self._execute_code(server_control_file, namespace)
  File "/build/amd64-generic/usr/local/autotest/server/server_job.py", line 1038, in _execute_code
    execfile(code_file, namespace, namespace)
  File "/tmp/cbuildbotmHQCJX/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/login_CryptohomeIncognitoMounted/control.srv", line 10, in <module>
    job.parallel_simple(run_client, machines)
  File "/build/amd64-generic/usr/local/autotest/server/server_job.py", line 457, in parallel_simple
    return_results=return_results)
  File "/build/amd64-generic/usr/local/autotest/server/subcommand.py", line 92, in parallel_simple
    function(arg)
  File "/tmp/cbuildbotmHQCJX/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/login_CryptohomeIncognitoMounted/control.srv", line 7, in run_client
    at.run(control, host=host)
  File "/build/amd64-generic/usr/local/autotest/server/autotest.py", line 358, in run
    client_disconnect_timeout)
  File "/build/amd64-generic/usr/local/autotest/server/autotest.py", line 379, in _do_run
    atrun.verify_machine()
  File "/build/amd64-generic/usr/local/autotest/server/autotest.py", line 500, in verify_machine
    "Autotest does not appear to be installed")


For 2) login_CryptohomeUnmounted failed with:

06/25 10:59:10.218 DEBUG|  ssh_host:0114| Running (ssh) 'mktemp -d /tmp/autoserv-XXXXXX'
06/25 10:59:40.261 ERROR|base_utils:0119| [stderr] Connection timed out during banner exchange
06/25 10:59:40.262 DEBUG|  ssh_host:0114| Running (ssh) 'true'
06/25 10:59:40.262 INFO |abstract_s:0628| Master ssh connection to 127.0.0.1 is down.
06/25 10:59:40.263 INFO |abstract_s:0640| Starting master ssh connection '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_hYp3gbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
06/25 10:59:40.263 DEBUG|base_utils:0081| Running '/usr/bin/ssh -a -x -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_hYp3gbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
06/25 11:00:10.306 WARNI|base_utils:0845| run process timeout (29.9999990463) fired on: /usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_hYp3gbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1 "export LIBC_FATAL_STDERR_=1; true"
06/25 11:00:12.311 DEBUG|abstract_s:0457| Host 127.0.0.1 is still down after waiting 32 seconds
06/25 11:00:12.312 DEBUG|base_utils:0081| Running 'rsync   --timeout=1800 --rsh='/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_hYp3gbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227' -az root@127.0.0.1:"/usr/local/autotest/results/default/" /tmp/cbuildbotcJMjXx/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/login_CryptohomeUnmounted'
06/25 11:00:12.456 WARNI|abstract_s:0290| trying scp, rsync failed: Command <rsync   --timeout=1800 --rsh='/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_hYp3gbssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227' -az root@127.0.0.1:"/usr/local/autotest/results/default/" /tmp/cbuildbotcJMjXx/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/login_CryptohomeUnmounted> failed, rc=23, Command returned non-zero exit status
* Command: 
    rsync   --timeout=1800 --rsh='/usr/bin/ssh -a -x -o ControlPath=/tmp
    /_autotmp_hYp3gbssh-master/socket -o StrictHostKeyChecking=no -o
    UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
    ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
    -o Protocol=2 -l root -p 9227' -az
    root@127.0.0.1:"/usr/local/autotest/results/default/" /tmp/cbuildbotcJMjXx
    /test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/login_Cryptoh
    omeUnmounted
Exit status: 23
Duration: 0.137355089188

stderr:
rsync: change_dir "/usr/local/autotest/results/default" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1518) [Receiver=3.0.8]


However, note that in both cases, what's actually going on is that we seem to fail some random SSH command, which has repercussions down the line, rather than something test related failing.
Jun 25, 2013
#1 wi...@chromium.org
I don't really understand the directory structure of the logs.  When I go into the logs for 1), I notice that for the test that failed, I can find a log of the failure here:

test_harness/failed/SimpleTestUpdateAndVerify/2_autotest_tests/login_CryptohomeUnmounted/debug/autoserv.DEBUG

But there are no system logs in this directory.  However, if I go to:

/usr/local/google/home/wiley/Desktop/test_harness/failed/SimpleTestUpdateAndVerify/2_autotest_tests/login_CryptohomeUnmounted/login_CryptohomeUnmounted/debug

I see logs for a passing instance of the test, and there are indeed system logs.

What gives?  Lab guys?
Owner: be...@chromium.org
Jun 25, 2013
#2 be...@chromium.org
Dan did some work around there recently calculation a log-diff and only pulling back what's necessary, so he's probably more qualified to answer. cc'ing sheriff also.
Owner: d...@chromium.org
Cc: mille...@chromium.org
Jun 25, 2013
#3 d...@chromium.org
While I'm not sure about the ssh issue, I can certainly explain the log location. A job stores everything in its job result folder, e.g.,
http://dshi.mtv/results/804-dshi/172.22.75.225/

In that folder you can find a folder with name login_CryptohomeMounted, that's the |test| name. In this case, it's the same as job name. In some cases a job can have multiple tests, so you can find several folders with name of each test. In that test result folder, you can find a folder hidden deep in sysinfo/var/log_diff/, that is where all the system logs are saved, well the diff of system logs. That is, the logs generated during that particular test run.
Jun 25, 2013
#4 wi...@chromium.org
Ok, so the debug directory in the job results directory is the debug output of the... job?  Not the test?
Jun 28, 2013
#6 wi...@chromium.org
(No comment was entered for this change.)
Labels: Hotlist-TreeCloser Gardening
Jun 28, 2013
#7 d...@chromium.org
Reply to #4, that debug directory is the debug output of job, not test.
crbug.com/254222 is created to update doc with changes about /var/log_diff information.

Owner: wi...@chromium.org
Jul 17, 2013
#8 d...@chromium.org
 Issue 261235  has been merged into this issue.
Jul 17, 2013
#9 r...@chromium.org
Saw this twice times on amd64-generic-full today:

http://chromegw.corp.google.com/i/chromiumos/builders/amd64%20generic%20full/builds/8742
http://chromegw.corp.google.com/i/chromiumos/builders/amd64%20generic%20full/builds/8736

Bumping the priority.

From the autotest logs:

07/17 16:24:05.726 ERROR|base_utils:0119| [stderr] Warning: Permanently added '[127.0.0.1]:9227' (RSA) to the list of known hosts.
07/17 16:24:06.343 ERROR|base_utils:0119| [stderr] Warning: Permanently added '[127.0.0.1]:9227' (RSA) to the list of known hosts.
07/17 16:24:36.435 ERROR|base_utils:0119| [stderr] Connection timed out during banner exchange
07/17 16:24:36.469 ERROR|base_utils:0119| [stderr] Warning: Permanently added '[127.0.0.1]:9227' (RSA) to the list of known hosts.
07/17 16:24:36.722 ERROR|server_job:0602| Exception escaped control file, job aborting:


Labels: -Pri-2 Pri-1
Jul 17, 2013
#10 r...@chromium.org
 Issue 215784  has been merged into this issue.
Jul 24, 2013
#11 bhthomp...@chromium.org
I believe this may have just happened again on amd64 generic ASAN, http://build.chromium.org/p/chromiumos/builders/amd64%20generic%20ASAN/builds/6471

If I dig into the logs, I see a similar sequence to comment #9 with the messages:
Connection timed out during banner exchange
Exception escaped control file, job aborting:
Aug 6, 2013
#13 dlau...@chromium.org
Sheriff checkin: this happens occasionally still, but there is not any more useful info in the logs that has not already been posted.

Latest is http://build.chromium.org/p/chromiumos/builders/x86%20generic%20incremental/builds/13158


Aug 7, 2013
#14 ty...@chromium.org
Two similar failures just occurred. One on x86 generic ASAN:
http://build.chromium.org/p/chromiumos/builders/x86%20generic%20ASAN/builds/13324

08/06 23:36:17.599 DEBUG|  ssh_host:0114| Running (ssh) '/bin/uname -r'
08/06 23:36:47.616 ERROR|base_utils:0119| [stderr] Connection timed out during banner exchange
08/06 23:36:47.618 ERROR|server_job:0602| Exception escaped control file, job aborting:

The other one on amd64 generic ASAN:
http://build.chromium.org/p/chromiumos/builders/amd64%20generic%20ASAN/builds/6689

08/06 15:29:02.729 ERROR|base_utils:0119| [stderr] ssh: connect to host 127.0.0.1 port 9227: Connection refused
08/06 15:29:02.731 INFO |    remote:0072| Failed to copy /var/log/messages at startup: command execution error

Aug 7, 2013
#15 ty...@chromium.org
Yet another one on x86 generic full:
http://build.chromium.org/p/chromiumos/builders/x86%20generic%20full/builds/10603

08/07 01:14:44.774 WARNI|abstract_s:0365| trying scp, rsync failed: Command <rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x -o ControlPath=/tmp/_autotmp_0jix1Ussh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227' -az /tmp/tmpJNjcQU root@127.0.0.1:"/usr/local/autotest/global_config.ini"> failed, rc=255, Command returned non-zero exit status
* Command: 
    rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x -o ControlPath=/tmp
    /_autotmp_0jix1Ussh-master/socket -o StrictHostKeyChecking=no -o
    UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
    ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
    -o Protocol=2 -l root -p 9227' -az /tmp/tmpJNjcQU
    root@127.0.0.1:"/usr/local/autotest/global_config.ini"
Exit status: 255
Duration: 30.0332450867

stderr:
Connection timed out during banner exchange
rsync: connection unexpectedly closed (0 bytes received so far) [sender]
rsync error: unexplained error (code 255) at io.c(601) [sender=3.0.8]
Aug 7, 2013
#16 scottz@chromium.org
Aviv can you try to reproduce this problem with one of the builds listed above. It is interesting to note that this is only being reported from the external waterfalls. 
Status: Assigned
Owner: akes...@chromium.org
Labels: vmtest Hardware-Lab
Aug 7, 2013
#17 mille...@chromium.org
http://chromegw/i/chromiumos/builders/x86%20generic%20ASAN/builds/13330
Aug 7, 2013
#18 akes...@chromium.org
I'm at a loss here. All I can think of is trying to add some logging code to make these failures more traceable. I'll look into the autotest ssh_host code and see if there's somewhere that traceability can be improved.
Aug 8, 2013
#19 mille...@chromium.org
This appears to have hit an internal builder too.

http://chromegw/i/chromeos/builders/parrot32%20canary/builds/101/steps/VMTest/logs/stdio
Aug 9, 2013
#20 dkr...@chromium.org
Hit this three times today so far:
http://chromegw/i/chromeos/builders/parrot32%20canary/builds/105
http://chromegw/i/chromiumos/builders/x86%20generic%20incremental/builds/13247
http://chromegw/i/chromiumos/builders/amd64%20generic%20ASAN/builds/6749
Aug 12, 2013
#21 scottz@chromium.org
(No comment was entered for this change.)
Labels: Iteration-88
Aug 12, 2013
#22 akes...@chromium.org
(No comment was entered for this change.)
Blockedon: chromium:271621
Aug 13, 2013
#23 charliem...@chromium.org
 Issue 271287  has been merged into this issue.
Aug 13, 2013
#24 bugdro...@chromium.org
Project: chromiumos/platform/crostestutils
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : 32e9fba24aa9eb8069f2305f6384c513810421ed

Code Review +2: Chris Sosa
Verified    +1: Aviv Keshet
Change-Id     : I966e379d9a82ce1bd1c68b9ab6ba34cdc40223a5
Reviewed-at   : https://gerrit.chromium.org/gerrit/65704

turn on ssh_verbosity=1 for VMTest

BUG=chromium:254166
TEST=None

Commit-Queue: Aviv Keshet <akeshet@chromium.org>

M  cros_run_vm_test
Aug 13, 2013
#25 akes...@chromium.org
VMTest is now using ssh verbosity level of 1. Hopefully this means we will have more useful information in the autoserv logs next time this occurs.
Aug 14, 2013
#26 charliem...@chromium.org
Just saw this again but now with the verbosity turned up hopefully there's something in there that's useful to you.

http://build.chromium.org/p/chromiumos/builders/x86%20generic%20incremental/builds/13322
vm_test_results (3).tgz
3.3 MB   Download
Aug 14, 2013
#27 dhend...@chromium.org
Just to elaborate a bit on what Charlie mentioned, the failure we saw just now (and on two other occasions in the past 24 hours) happens during login_CryptohomeIncognitoUnmounted. Here is some of the log output:

AutoservRunError: command execution error
* Command: 
    /usr/bin/ssh -a -x -v  -o ControlPath=/tmp/_autotmp_ov6r27ssh-
    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o
    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p
    9227 127.0.0.1 "export LIBC_FATAL_STDERR_=1; mkdir -p /tmp"
Exit status: 255
Duration: 30.0368580818

stderr:
OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
debug1: Reading configuration data /home/chrome-bot/.ssh/config
debug1: Applying options for *
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: Control socket "/tmp/_autotmp_ov6r27ssh-master/socket" does not exist
debug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.
debug1: fd 15 clearing O_NONBLOCK
debug1: Connection established.
debug1: identity file /home/chrome-bot/.ssh/chrome-bot-private type -1
Connection timed out during banner exchange
08/14 14:58:35.184 ERROR|base_utils:0119| [stderr] OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
08/14 14:58:35.186 ERROR|base_utils:0119| [stderr] debug1: Reading configuration data /home/chrome-bot/.ssh/config
08/14 14:58:35.186 ERROR|base_utils:0119| [stderr] debug1: Applying options for *
08/14 14:58:35.186 ERROR|base_utils:0119| [stderr] debug1: Reading configuration data /etc/ssh/ssh_config
08/14 14:58:35.186 ERROR|base_utils:0119| [stderr] debug1: Control socket "/tmp/_autotmp_HYNdzXssh-master/socket" does not exist
08/14 14:58:35.187 ERROR|base_utils:0119| [stderr] debug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.
08/14 14:58:35.187 ERROR|base_utils:0119| [stderr] debug1: fd 15 clearing O_NONBLOCK
08/14 14:58:35.187 ERROR|base_utils:0119| [stderr] debug1: Connection established.
08/14 14:58:35.187 ERROR|base_utils:0119| [stderr] debug1: identity file /home/chrome-bot/.ssh/chrome-bot-private type -1
08/14 14:58:35.192 ERROR|base_utils:0119| [stderr] debug1: Remote protocol version 2.0, remote software version OpenSSH_5.2
08/14 14:58:35.192 ERROR|base_utils:0119| [stderr] debug1: match: OpenSSH_5.2 pat OpenSSH*
08/14 14:58:35.193 ERROR|base_utils:0119| [stderr] debug1: Enabling compatibility mode for protocol 2.0
08/14 14:58:35.193 ERROR|base_utils:0119| [stderr] debug1: Local version string SSH-2.0-OpenSSH_5.2
08/14 14:58:35.193 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_KEXINIT sent
08/14 14:58:35.193 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_KEXINIT received
08/14 14:58:35.194 ERROR|base_utils:0119| [stderr] debug1: kex: server->client aes128-ctr hmac-md5 none
08/14 14:58:35.194 ERROR|base_utils:0119| [stderr] debug1: kex: client->server aes128-ctr hmac-md5 none
08/14 14:58:35.194 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent
08/14 14:58:35.194 ERROR|base_utils:0119| [stderr] debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP
08/14 14:58:35.196 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_KEX_DH_GEX_INIT sent
08/14 14:58:35.196 ERROR|base_utils:0119| [stderr] debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY
08/14 14:58:35.209 ERROR|base_utils:0119| [stderr] debug1: checking without port identifier
08/14 14:58:35.209 ERROR|base_utils:0119| [stderr] Warning: Permanently added '[127.0.0.1]:9227' (RSA) to the list of known hosts.
08/14 14:58:35.210 ERROR|base_utils:0119| [stderr] debug1: ssh_rsa_verify: signature correct
08/14 14:58:35.210 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_NEWKEYS sent
08/14 14:58:35.210 ERROR|base_utils:0119| [stderr] debug1: expecting SSH2_MSG_NEWKEYS
08/14 14:58:35.210 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_NEWKEYS received
08/14 14:58:35.210 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_SERVICE_REQUEST sent
08/14 14:58:35.249 ERROR|base_utils:0119| [stderr] debug1: SSH2_MSG_SERVICE_ACCEPT received
08/14 14:58:35.251 ERROR|base_utils:0119| [stderr] debug1: Authentications that can continue: publickey,keyboard-interactive
08/14 14:58:35.251 ERROR|base_utils:0119| [stderr] debug1: Next authentication method: publickey
08/14 14:58:35.251 ERROR|base_utils:0119| [stderr] debug1: Offering public key: /tmp/cbuildbotJtlDAg/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/autotest_key
08/14 14:58:35.252 ERROR|base_utils:0119| [stderr] debug1: Server accepts key: pkalg ssh-rsa blen 277
08/14 14:58:35.256 ERROR|base_utils:0119| [stderr] debug1: Authentication succeeded (publickey).
08/14 14:58:35.256 ERROR|base_utils:0119| [stderr] debug1: channel 0: new [client-session]
08/14 14:58:35.257 ERROR|base_utils:0119| [stderr] debug1: Requesting no-more-sessions@openssh.com
08/14 14:58:35.257 ERROR|base_utils:0119| [stderr] debug1: Entering interactive session.
08/14 14:58:35.258 ERROR|base_utils:0119| [stderr] debug1: Sending command: export LIBC_FATAL_STDERR_=1; python -c 'import cPickle, glob, sys;cPickle.dump(glob.glob(sys.argv[1]), sys.stdout, 0)' "/var/spool/crash/*"
08/14 14:58:35.295 ERROR|base_utils:0119| [stderr] debug1: client_input_channel_req: channel 0 rtype exit-status reply 0
08/14 14:58:35.295 ERROR|base_utils:0119| [stderr] debug1: channel 0: free: client-session, nchannels 1
08/14 14:58:35.295 ERROR|base_utils:0119| [stderr] debug1: fd 0 clearing O_NONBLOCK
08/14 14:58:35.295 ERROR|base_utils:0119| [stderr] debug1: fd 1 clearing O_NONBLOCK
08/14 14:58:35.296 ERROR|base_utils:0119| [stderr] debug1: fd 2 clearing O_NONBLOCK
08/14 14:58:35.296 ERROR|base_utils:0119| [stderr] Transferred: sent 2336, received 2248 bytes, in 0.0 seconds
08/14 14:58:35.296 ERROR|base_utils:0119| [stderr] Bytes per second: sent 60410.3, received 58134.6
08/14 14:58:35.296 ERROR|base_utils:0119| [stderr] debug1: Exit status 0
08/14 14:58:35.307 ERROR|base_utils:0119| [stderr] OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
08/14 14:58:35.308 ERROR|base_utils:0119| [stderr] debug1: Reading configuration data /home/chrome-bot/.ssh/config
08/14 14:58:35.309 ERROR|base_utils:0119| [stderr] debug1: Applying options for *
08/14 14:58:35.309 ERROR|base_utils:0119| [stderr] debug1: Reading configuration data /etc/ssh/ssh_config
08/14 14:58:35.511 ERROR|base_utils:0119| [stderr] OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
08/14 14:58:35.512 ERROR|base_utils:0119| [stderr] debug1: Reading configuration data /home/chrome-bot/.ssh/config
08/14 14:58:35.513 ERROR|base_utils:0119| [stderr] debug1: Applying options for *
08/14 14:58:35.513 ERROR|base_utils:0119| [stderr] debug1: Reading configuration data /etc/ssh/ssh_config
08/14 14:58:37.531 ERROR| traceback:0013| Traceback (most recent call last):
08/14 14:58:37.531 ERROR| traceback:0013|   File "./server/autoserv", line 184, in run_autoserv
08/14 14:58:37.532 ERROR| traceback:0013|   File "/build/x86-generic/usr/local/autotest/server/site_server_job.py", line 65, in run
08/14 14:58:37.532 ERROR| traceback:0013|     super(site_server_job, self).run(*args, **dargs)
08/14 14:58:37.532 ERROR| traceback:0013|   File "/build/x86-generic/usr/local/autotest/server/server_job.py", line 604, in run
08/14 14:58:37.533 ERROR| traceback:0013|     {'job_abort_reason': str(e)})
08/14 14:58:37.533 ERROR| traceback:0013|   File "/build/x86-generic/usr/local/autotest/client/common_lib/base_job.py", line 1201, in record
08/14 14:58:37.533 ERROR| traceback:0013|     optional_fields)
08/14 14:58:37.534 ERROR| traceback:0013|   File "/build/x86-generic/usr/local/autotest/client/common_lib/base_job.py", line 476, in __init__
08/14 14:58:37.534 ERROR| traceback:0013|     % (key, value))
08/14 14:58:37.534 ERROR| traceback:0013| ValueError: Invalid character in 'job_abort_reason'='command execution error\n* Command: \n    /usr/bin/ssh -a -x -v  -o ControlPath=/tmp/_autotmp_ov6r27ssh-\n    master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null\n    -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o\n    ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p\n    9227 127.0.0.1 "export LIBC_FATAL_STDERR_=1; mkdir -p /tmp"\nExit status: 255\nDuration: 30.0368580818\n\nstderr:\nOpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012\ndebug1: Reading configuration data /home/chrome-bot/.ssh/config\r\ndebug1: Applying options for *\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: Control socket "/tmp/_autotmp_ov6r27ssh-master/socket" does not exist\r\ndebug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.\r\ndebug1: fd 15 clearing O_NONBLOCK\r\ndebug1: Connection established.\r\ndebug1: identity file /home/chrome-bot/.ssh/chrome-bot-private type -1\r\nConnection timed out during banner exchange' field

Aug 20, 2013
#28 akes...@chromium.org
I'm not seeing anything particularly enlightening in these ssh debug logs. It's possible that there is more useful information buried in ssh debug levels 2 or 3. However, turning up the debug level seems to make autoserv randomly deadlock (see https://code.google.com/p/chromium/issues/detail?id=276473 ).
Aug 21, 2013
#29 dgarrett@chromium.org
 Issue 277186  has been merged into this issue.
Aug 21, 2013
#30 dgarrett@chromium.org
Another case seen.

Aug 23, 2013
#31 scottz@chromium.org
 Issue 278315  has been merged into this issue.
Cc: thie...@chromium.org shawnn@chromium.org
Aug 23, 2013
#33 thie...@chromium.org
Happened again:
http://chromegw/i/chromeos/builders/mario%20paladin/builds/12727/steps/VMTest/logs/stdio

13:24:12 INFO | autoserv| [stderr] Connection timed out during banner exchange
13:24:12 INFO | autoserv| Exception escaped control file, job aborting:
13:24:12 INFO | autoserv| Traceback (most recent call last):
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/server_job.py", line 599, in run
13:24:12 INFO | autoserv| self._execute_code(server_control_file, namespace)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/server_job.py", line 1092, in _execute_code
13:24:12 INFO | autoserv| execfile(code_file, namespace, namespace)
13:24:12 INFO | autoserv| File "/tmp/cbuildbotR9y8kR/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-08-security_OpenSSLBlacklist/control.srv", line 10, in <module>
13:24:12 INFO | autoserv| job.parallel_simple(run_client, machines)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/server_job.py", line 489, in parallel_simple
13:24:12 INFO | autoserv| return_results=return_results)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/subcommand.py", line 92, in parallel_simple
13:24:12 INFO | autoserv| function(arg)
13:24:12 INFO | autoserv| File "/tmp/cbuildbotR9y8kR/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-08-security_OpenSSLBlacklist/control.srv", line 7, in run_client
13:24:12 INFO | autoserv| at.run(control, host=host)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/autotest.py", line 347, in run
13:24:12 INFO | autoserv| host = self._get_host_and_setup(host)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/autotest.py", line 365, in _get_host_and_setup
13:24:12 INFO | autoserv| self.install(host)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/site_autotest.py", line 110, in install
13:24:12 INFO | autoserv| super(SiteAutotest, self).install(host=host, autodir=autodir)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/autotest.py", line 161, in install
13:24:12 INFO | autoserv| self._install(host=host, autodir=autodir)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/site_autotest.py", line 131, in _install
13:24:12 INFO | autoserv| use_packaging)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/autotest.py", line 266, in _install
13:24:12 INFO | autoserv| self._install_using_send_file(host, autodir)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/autotest.py", line 210, in _install_using_send_file
13:24:12 INFO | autoserv| host.run(';'.join(commands))
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/hosts/ssh_host.py", line 123, in run
13:24:12 INFO | autoserv| stdin, args)
13:24:12 INFO | autoserv| File "/build/x86-mario/usr/local/autotest/server/hosts/ssh_host.py", line 94, in _run
13:24:12 INFO | autoserv| raise error.AutoservRunError("command execution error", result)
13:24:12 INFO | autoserv| AutoservRunError: command execution error
13:24:12 INFO | autoserv| * Command:
13:24:12 INFO | autoserv| /usr/bin/ssh -a -x -v  -o ControlPath=/tmp/_autotmp_1NlfmSssh-
13:24:12 INFO | autoserv| master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
13:24:12 INFO | autoserv| -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o
13:24:12 INFO | autoserv| ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p
13:24:12 INFO | autoserv| 9227 127.0.0.1 "export LIBC_FATAL_STDERR_=1; mkdir -p
13:24:12 INFO | autoserv| '/usr/local/autotest/site_tests';touch
13:24:12 INFO | autoserv| '/usr/local/autotest/site_tests'/__init__.py;mkdir -p
13:24:12 INFO | autoserv| '/usr/local/autotest/tests';touch
13:24:12 INFO | autoserv| '/usr/local/autotest/tests'/__init__.py;mkdir -p
13:24:12 INFO | autoserv| '/usr/local/autotest/packages';touch
13:24:12 INFO | autoserv| '/usr/local/autotest/packages'/__init__.py;mkdir -p
13:24:12 INFO | autoserv| '/usr/local/autotest/deps';touch
13:24:12 INFO | autoserv| '/usr/local/autotest/deps'/__init__.py;mkdir -p
13:24:12 INFO | autoserv| '/usr/local/autotest/profilers';touch
13:24:12 INFO | autoserv| '/usr/local/autotest/profilers'/__init__.py"
13:24:12 INFO | autoserv| Exit status: 255
13:24:12 INFO | autoserv| Duration: 30.034981966

Aug 23, 2013
#34 thie...@chromium.org
The CQ is waaaay backed up.  Happened again, upping priority.

http://build.chromium.org/p/chromiumos/builders/amd64%20generic%20paladin/builds/15460/steps/VMTest/logs/stdio

14:28:46 INFO | autoserv| Exception escaped control file, job aborting:
14:28:46 INFO | autoserv| Traceback (most recent call last):
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/server/server_job.py", line 599, in run
14:28:46 INFO | autoserv| self._execute_code(server_control_file, namespace)
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/server/server_job.py", line 1092, in _execute_code
14:28:46 INFO | autoserv| execfile(code_file, namespace, namespace)
14:28:46 INFO | autoserv| File "/tmp/cbuildbotqq58Dk/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-07-security_Minijail0/control.srv", line 10, in <module>
14:28:46 INFO | autoserv| job.parallel_simple(run_client, machines)
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/server/server_job.py", line 489, in parallel_simple
14:28:46 INFO | autoserv| return_results=return_results)
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/server/subcommand.py", line 92, in parallel_simple
14:28:46 INFO | autoserv| function(arg)
14:28:46 INFO | autoserv| File "/tmp/cbuildbotqq58Dk/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-07-security_Minijail0/control.srv", line 6, in run_client
14:28:46 INFO | autoserv| host.log_kernel()
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 576, in log_kernel
14:28:46 INFO | autoserv| kernel = self.get_kernel_ver()
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 535, in get_kernel_ver
14:28:46 INFO | autoserv| return self.run('/bin/uname -r').stdout.rstrip()
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/server/hosts/ssh_host.py", line 123, in run
14:28:46 INFO | autoserv| stdin, args)
14:28:46 INFO | autoserv| File "/build/amd64-generic/usr/local/autotest/server/hosts/ssh_host.py", line 94, in _run
14:28:46 INFO | autoserv| raise error.AutoservRunError("command execution error", result)
14:28:46 INFO | autoserv| AutoservRunError: command execution error
14:28:46 INFO | autoserv| * Command:
14:28:46 INFO | autoserv| /usr/bin/ssh -a -x -v  -o ControlPath=/tmp/_autotmp_uHAu_3ssh-
14:28:46 INFO | autoserv| master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null
14:28:46 INFO | autoserv| -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o
14:28:46 INFO | autoserv| ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p
14:28:46 INFO | autoserv| 9227 127.0.0.1 "export LIBC_FATAL_STDERR_=1; /bin/uname -r"
14:28:46 INFO | autoserv| Exit status: 255
14:28:46 INFO | autoserv| Duration: 30.0347189903
14:28:46 INFO | autoserv| 
14:28:46 INFO | autoserv| stderr:
14:28:46 INFO | autoserv| OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
14:28:46 INFO | autoserv| debug1: Reading configuration data /home/chrome-bot/.ssh/config
14:28:46 INFO | autoserv| debug1: Applying options for *
14:28:46 INFO | autoserv| debug1: Reading configuration data /etc/ssh/ssh_config
14:28:46 INFO | autoserv| debug1: Control socket "/tmp/_autotmp_uHAu_3ssh-master/socket" does not exist
14:28:46 INFO | autoserv| debug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.
14:28:46 INFO | autoserv| debug1: fd 25 clearing O_NONBLOCK
14:28:46 INFO | autoserv| debug1: Connection established.
14:28:46 INFO | autoserv| debug1: identity file /home/chrome-bot/.ssh/chrome-bot-private type -1
14:28:46 INFO | autoserv| Connection timed out during banner exchange

Labels: -Pri-1 Pri-0
Aug 23, 2013
#35 dgarrett@chromium.org
https://doc-0d5k8-1g46q-s-googleusercontent.commondatastorage.googleapis.com/gs/r4fmh8k7nf3bsp3i1qvpp91pc7rt7r0r/i2i7a0h45e18fjvgiaj914ic0f8bkv3v/1377279000000/chromeos-autotest-results/01191233689280036672/ChljaHJvbWVvcy1hdXRvdGVzdC1yZXN1bHRzEkU0MzcyNTc0LWNocm9tZW9zLXRlc3QvY2hyb21lb3MyLXJvdzItcmFjazctaG9zdDYvZGVidWcvYXV0b3NlcnYuREVCVUc?a=AGjQbs5pFFZEWu-Lx877xNQEIuHs_hVNJ87PhWVIE5_WBbH37q0cjHrBe7rmfyKFWJNBw-cqxEiUxlZjwBhUjUO0MxOE7l5rBo4Eyvm_hOOi8D-4nz2XyK7T5pUbBv82JBS3ZtXjvnk9JOJr-2wr9a25ODd5IdRkG_Yi2u45aHatUbqhamnT1B7rs_NiE9jKFBf6FoHI21L1RjxCTOgVyNej_xKxfkJmJZTDhESvZjhQ4QyR3eWK3GbqnVBm6KkiBod90FxUFRUhwYT957u-VERB0oEa3WSO-jWpjz6xBHUONRcGz0l-D0uRDehJ3aB8no3JKELZNn9nSdbnxQs_t2jWYvbubdjT59UiraZ-2rErDrHjZcLHj-b4rmH8M6tWiHZA02yjGvKfG32GGrDwIh7I7qCn7-Z9UUHBLdq6ZSUEJxJnuFcKTq821uHCHPqXe4X4uIwMV4b_wS8cU9eAx4Mv-xkGg8tdSdUExVjKeWQtv2Wu-ZpllciiOyI30apn6dGdl2IredOjlEYOqmRzIyGKrjSIkRyhSzeyMpBezj9h93uUS2zPmuCmxOn6YNFNg9IyiHGZrYJrWX35hXH5TWhw2NPiCyI8phg1p6dTCCPNcaKu_qYL8sKD9FgEY44MiuHOnoLDkq18XtJ0wZMGKdA8XTE7xNcUubx5T4Wsw1tvLebaJzLvktH7t7UoJagdUPZoqMi8uD_OomwHnOYlPKt6ZlEu2bTR3gByL2VECqj2E0LdTya2f3P9FhxA8UV3QidjL850Wo-9a6FjhzACB4sJG7LxLTwuESHx03huZ42ZrAAUJTn6AWWus2EL5vYDKCtSiVgGLIyPSeKmZc9Ms94sHK7kW-EK495Czau_7qv3JggP39QZW5KdFi1ZS-ocs21aA3oGeSOgukGAHgtLHdXbv33Zca0j7d0Ty4LZ2tQsUvjd-Fmiet5ee5IRRaDoCt6UQJqaSNGMhOyr2sdFVClAz9z9aY5Mehl4j1rmC82F4RiMn-9YeU-hO-41hhOIN-k2cR-6soHQ0ljxu3nEphzQ8e0tVy4x4Vo1nbwe6rLbWAGqOVcrMRCMKW6_5uXEZ_Yik9wYgxllJ3pBjfXrGgG5N6PBVFpx1g&nonce=a4pbvu5di8ai6&user=01191233689280036672&hash=cimd7edol18shdl0lntup0bd28sd1l7b

08/23 10:01:38.584 DEBUG|base_utils:0081| Running 'rsync -L --delete --timeout=1800 --rsh='/usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_IyKxNmssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az /usr/local/google/chromeos/src/platform/dev/stateful_update root@chromeos2-row2-rack7-host6:"/tmp/stateful_update"'
08/23 10:01:38.671 DEBUG|  ssh_host:0114| Running (ssh) '/tmp/stateful_update --stateful_change=reset 2>&1'
08/23 10:01:38.835 DEBUG|base_utils:0119| [stdout] Resetting stateful update state.
08/23 10:01:40.545 INFO |autoupdate:0336| Collecting update engine logs...
08/23 10:01:40.546 DEBUG|base_utils:0081| Running 'rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_IyKxNmssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az root@chromeos2-row2-rack7-host6:"/var/log/messages\\ /var/log/update_engine" /usr/local/autotest/results/hosts/chromeos2-row2-rack7-host6/7093537-provision/sysinfo'
08/23 10:01:40.700 WARNI|abstract_s:0288| trying scp, rsync failed: Command <rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x  -o ControlPath=/tmp/_autotmp_IyKxNmssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 22' -az root@chromeos2-row2-rack7-host6:"/var/log/messages\\ /var/log/update_engine" /usr/local/autotest/results/hosts/chromeos2-row2-rack7-host6/7093537-provision/sysinfo> failed, rc=23, Command returned non-zero exit status
* Command: 
    rsync -L  --timeout=1800 --rsh='/usr/bin/ssh -a -x  -o ControlPath=/tmp
    /_autotmp_IyKxNmssh-master/socket -o StrictHostKeyChecking=no -o
    UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
    ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
    -o Protocol=2 -l root -p 22' -az
    root@chromeos2-row2-rack7-host6:"/var/log/messages\\
    /var/log/update_engine" /usr/local/autotest/results/hosts/chromeos2-row2-r
    ack7-host6/7093537-provision/sysinfo
Exit status: 23
Duration: 0.148719072342

stderr:
rsync: change_dir "/var/log/messages /var/log" failed: No such file or directory (2)
rsync error: some files/attrs were not transferred (see previous errors) (code 23) at main.c(1536) [Receiver=3.0.9]
08/23 10:01:40.700 DEBUG|base_utils:0081| Running 'scp -rq -o ControlPath=/tmp/_autotmp_IyKxNmssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/tmpDG5XKd -P 22 root@chromeos2-row2-rack7-host6:"/var/log/messages /var/log/update_engine" '/usr/local/autotest/results/hosts/chromeos2-row2-rack7-host6/7093537-provision/sysinfo''
08/23 10:01:40.802 ERROR|provision_:0072| update-engine failed on chromeos2-row2-rack7-host6
Aug 23, 2013
#36 akes...@chromium.org
I'm investigating why turning up the ssh verbosity causes deadlocks. I think that should be traceable. Once that is fixed, I'll turn up the verbosity and see if we get more useful information.
Blockedon: chromium:276473
Aug 26, 2013
#37 scottz@chromium.org
Is this related to 212879? They seem quite similar.
Aug 26, 2013
#38 wi...@chromium.org
Possibly, but not necessarily.  SSH not coming up at all could be all sort of things, even something as simple as "VM failed to boot."  SSH working and then not working  seems like something more bizarre.  I think the message we're getting on failures indicates that we're actually able to open the TCP connection, but then no data flows over the connection.
Aug 26, 2013
#39 shawnn@chromium.org
Seems like another instance of the same root issue?

http://build.chromium.org/p/chromiumos/builders/x86%20generic%20incremental/builds/13525

10:50:37 INFO | autoserv| trying scp, rsync failed: Command <rsync   --timeout=1800 --rsh='/usr/bin/ssh -a -x -v -o ControlPath=/tmp/_autotmp_xZG9tVssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227' -az root@127.0.0.1:"/usr/local/autotest/results/default/" /tmp/cbuildbotO75GaA/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-15-security_HardlinkRestrictions> failed, rc=23, Command returned non-zero exit status

Aug 26, 2013
#40 akes...@chromium.org
(No comment was entered for this change.)
Labels: Iteration-89
Aug 26, 2013
#41 shawnn@chromium.org
http://build.chromium.org/p/chromiumos/builders/x86%20generic%20incremental/builds/13529/steps/VMTest/logs/stdio

I believe this is another instance:

16:39:47 INFO | autoserv| [stderr] OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
16:39:47 INFO | autoserv| [stderr] debug1: Reading configuration data /home/chrome-bot/.ssh/config
16:39:47 INFO | autoserv| [stderr] debug1: Applying options for *
16:39:47 INFO | autoserv| [stderr] debug1: Reading configuration data /etc/ssh/ssh_config
16:39:47 INFO | autoserv| [stderr] debug1: Control socket "/tmp/_autotmp_7GpAS7ssh-master/socket" does not exist
16:39:47 INFO | autoserv| [stderr] debug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.
16:39:47 INFO | autoserv| [stderr] debug1: fd 25 clearing O_NONBLOCK
16:39:47 INFO | autoserv| [stderr] debug1: Connection established.
16:39:47 INFO | autoserv| [stderr] debug1: identity file /home/chrome-bot/.ssh/chrome-bot-private type -1
16:40:17 INFO | autoserv| [stderr] Connection timed out during banner exchange
16:40:17 INFO | autoserv| Exception escaped control file, job aborting:
16:40:17 INFO | autoserv| Traceback (most recent call last):
16:40:17 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/server_job.py", line 599, in run
16:40:17 INFO | autoserv| self._execute_code(server_control_file, namespace)
16:40:17 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/server_job.py", line 1092, in _execute_code
16:40:17 INFO | autoserv| execfile(code_file, namespace, namespace)
16:40:17 INFO | autoserv| File "/tmp/cbuildbotrjjSbj/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-08-security_Minijail_seccomp/control.srv", line 10, in <module>
16:40:17 INFO | autoserv| job.parallel_simple(run_client, machines)
16:40:17 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/server_job.py", line 489, in parallel_simple
16:40:17 INFO | autoserv| return_results=return_results)
16:40:17 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/subcommand.py", line 92, in parallel_simple
16:40:17 INFO | autoserv| function(arg)
16:40:17 INFO | autoserv| File "/tmp/cbuildbotrjjSbj/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-08-security_Minijail_seccomp/control.srv", line 7, in run_client
16:40:17 INFO | autoserv| at.run(control, host=host)
16:40:17 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/autotest.py", line 358, in run
16:40:17 INFO | autoserv| client_disconnect_timeout)
16:40:17 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/autotest.py", line 379, in _do_run
16:40:17 INFO | autoserv| atrun.verify_machine()
16:40:17 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/autotest.py", line 500, in verify_machine
16:40:17 INFO | autoserv| "Autotest does not appear to be installed")
16:40:17 INFO | autoserv| AutoservInstallError: Autotest does not appear to be installed

Aug 26, 2013
#42 akes...@chromium.org
The debug level is about to be turned up to 3. (that change is in the CQ). So next time this happens, maybe we'll learn something.
Sep 2, 2013
#45 akes...@chromium.org
ssh verbosity in VMTest was turned up to 3, finally, in https://chromium-review.googlesource.com/#/c/66968/ . Future failures of this type should have more useful information.
Sep 3, 2013
#46 mille...@chromium.org
 Issue 284725  has been merged into this issue.
Sep 4, 2013
#48 akes...@chromium.org
Hmm. The banner exchange failure always seems to be preceded by a bunch of key-read errors, and references to badly formed key files on disk. But test_that is running within an ssh-agent environment that already has the necessary test key associated with it, so there should be no need to even look at the contents of the ~/.ssh/ directory. Presumably there is some ssh flag to tell ssh to ignore that. Will investigate further tomorrow.
Sep 4, 2013
#49 bugdro...@chromium.org
Project: chromiumos/platform/crostestutils
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : 5ab7a26d0ff9b2d58f3b55789bb066ffa9619b03

Code-Review  0 : Chris Sosa, ChromeBot, chrome-internal-fetch
Code-Review  +2: Aviv Keshet
Verified     0 : Chris Sosa, ChromeBot, chrome-internal-fetch
Verified     +1: Aviv Keshet
Commit Queue   : Chumped
Change-Id      : I94fa11f3a04850ee405d0781aa8b535e044a5967
Reviewed-at    : https://chromium-review.googlesource.com/66968

turn up VMTest ssh verbosity to 3

This CL increases the ssh verbosity of VMTest, hopefully to help in
diagnosing ssh connection failures.

CQ-DEPEND=CL:I7e91c89cbee138a57015b4b8578778afda090ca0
CQ-DEPEND=CL:I1a6861501eb4fc8439add213c79eaa3777978df0
CQ-DEPEND=CL:Iee7722b55cc5e7497888ee86154c3ddb3b7bdf6e
CQ-DEPEND=CL:Ib92d7491eb45f50e0b95b4e9e32e72778d002b08
BUG=chromium:254166
TEST=None

M  cros_run_vm_test
Sep 4, 2013
#50 bugdro...@chromium.org
Project: chromiumos/platform/crostestutils
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : 5ab7a26d0ff9b2d58f3b55789bb066ffa9619b03

Code-Review  0 : Chris Sosa, ChromeBot, chrome-internal-fetch
Code-Review  +2: Aviv Keshet
Verified     0 : Chris Sosa, ChromeBot, chrome-internal-fetch
Verified     +1: Aviv Keshet
Commit Queue   : Chumped
Change-Id      : I94fa11f3a04850ee405d0781aa8b535e044a5967
Reviewed-at    : https://chromium-review.googlesource.com/66968

turn up VMTest ssh verbosity to 3

This CL increases the ssh verbosity of VMTest, hopefully to help in
diagnosing ssh connection failures.

CQ-DEPEND=CL:I7e91c89cbee138a57015b4b8578778afda090ca0
CQ-DEPEND=CL:I1a6861501eb4fc8439add213c79eaa3777978df0
CQ-DEPEND=CL:Iee7722b55cc5e7497888ee86154c3ddb3b7bdf6e
CQ-DEPEND=CL:Ib92d7491eb45f50e0b95b4e9e32e72778d002b08
BUG=chromium:254166
TEST=None

M  cros_run_vm_test
Sep 4, 2013
#52 scottz@chromium.org
Are there any stop gap solutions we can put in place while we further diagnose this?
Sep 4, 2013
#53 sosa@chromium.org
one could retry vm tests?
Sep 4, 2013
#54 akes...@chromium.org
Per comment #48 -- I strongly suspect that running VMTests with ssh options "-F /dev/null -i /dev/null" will fix this. Am working on threading an ssh_options parameter through autoserv and test_that.
Status: Started
Sep 4, 2013
#55 akes...@chromium.org
Here's a tryjob with a bit of a hammer patched in -- adding "-F /dev/null -i /dev/null" globally to all of autotest's ssh commands. This has eliminted a large number of complaints about badly formed key files and rejected keys are now no longer there. http://chromegw/i/chromiumos.tryserver/builders/lumpy-paladin/builds/1142
Sep 4, 2013
#56 akes...@chromium.org
(No comment was entered for this change.)
Blockedon: chromium:285378
Sep 4, 2013
#57 mtennant@chromium.org
Possibly related vmtest timeout failure.  akeshet speculated it could be related, but if not then this should be spun off into separate issue.

http://build.chromium.org/p/chromiumos/builders/x86%20generic%20full/builds/10951/steps/VMTest/logs/stdio

Snippet follows:
************************************************************
** Start Stage VMTest - 13:21:29
** 
** Run autotests in a virtual machine.
************************************************************
13:21:30: INFO: RunCommand: 'bin/ctest' '--board=x86-generic' '--type=vm' '--no_graphics' '--target_image=/b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--test_results_root=/b/cbuild/chroot/tmp/cbuildbotHiypdv/test_harness' '--quick' '--whitelist_chrome_crashes' in /b/cbuild/src/scripts
2013/09/04 13:21:30 - ctest - INFO    : Could not find a latest image to use. Using target instead.
2013/09/04 13:21:30 - cros_build_lib.py - INFO    : RunCommand: '../platform/crostestutils/generate_test_payloads/cros_generate_test_payloads.py' '--target=/b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--base=/b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--board=x86-generic' '--jobs=1' '--basic_suite' in /b/cbuild/src/scripts
2013/09/04 13:21:30 - cros_generate_test_payloads.py - INFO    : Generating updates required for this test suite in parallel.
2013/09/04 13:21:30 - cros_generate_test_payloads.py - INFO    : Generating an unsigned delta payload from /b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_qemu_image.bin to /b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin and not patching the kernel.
2013/09/04 13:26:31 - cros_generate_test_payloads.py - INFO    : Dumping /b/cbuild/src/build/images/x86-generic/latest-cbuildbot/update.cache
2013/09/04 13:26:31 - cros_build_lib.py - INFO    : RunCommand: '/b/cbuild/src/platform/crostestutils/au_test_harness/cros_au_test_harness.py' '--base_image=/b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--target_image=/b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--board=x86-generic' '--type=vm' '--remote=0.0.0.0' '--verbose' '--jobs=1' '--test_prefix=SimpleTest' '--test_results_root=/b/cbuild/chroot/tmp/cbuildbotHiypdv/test_harness' '--no_graphics' '--whitelist_chrome_crashes' in /b/cbuild/src/scripts
2013/09/04 13:26:32 - cros_build_lib.py - INFO    : Loading update cache from /b/cbuild/src/build/images/x86-generic/latest-cbuildbot/update.cache
2013/09/04 13:26:32 - cros_build_lib.py - DEBUG   : RunCommand: 'cros_sdk' '--' 'sudo' 'CROS_CACHEDIR=/b/cbuild/.cache' 'CROS_SUDO_KEEP_ALIVE=unknown' '--' 'start_devserver' '--pidfile' '/tmp/devserver_wrapperPy3MWH' '--logfile' '/tmp/cbuildbotHiypdv/test_harness/dev_server.log' in /b/cbuild
2013/09/04 13:26:35 - cros_build_lib.py - INFO    : Running tests in test suite in parallel.
2013/09/04 13:26:35 - cros_build_lib.py - INFO    : SimpleTestUpdateAndVerify: Performing a delta update from /b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_qemu_image.bin to /b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin in a VM
2013/09/04 13:28:45 - cros_build_lib.py - INFO    : SimpleTestUpdateAndVerify: Running smoke suite to verify image.
EStopping the KVM instance

======================================================================
ERROR: SimpleTestUpdateAndVerify (crostestutils.au_test_harness.au_test.AUTest)
Test that updates to itself.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/b/cbuild/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/au_test.py", line 185, in SimpleTestUpdateAndVerify
    self.assertTrue(self.worker.VerifyImage())
  File "/b/cbuild/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/vm_au_worker.py", line 149, in VerifyImage
    cwd=constants.CROSUTILS_DIR, error_code_ok=True)
  File "/b/cbuild/chromite/lib/cros_build_lib.py", line 722, in RunCommandCaptureOutput
    redirect_stderr=kwds.pop('redirect_stderr', True), **kwds)
  File "/b/cbuild/chromite/lib/cros_build_lib.py", line 425, in RunCommand
    (cmd_result.output, cmd_result.error) = proc.communicate(input)
  File "/usr/lib/python2.7/subprocess.py", line 751, in communicate
    self.wait()
  File "/usr/lib/python2.7/subprocess.py", line 1291, in wait
    pid, sts = _eintr_retry_call(os.waitpid, self.pid, 0)
  File "/usr/lib/python2.7/subprocess.py", line 478, in _eintr_retry_call
    return func(*args)
  File "/b/cbuild/chromite/lib/cros_build_lib.py", line 994, in kill_us
    raise TimeoutError("Timeout occurred- waited %i seconds." % max_run_time)
TimeoutError: Timeout occurred- waited 2400 seconds.

----------------------------------------------------------------------
Ran 1 test in 2401.407s

FAILED (errors=1)
2013/09/04 14:06:36 - cros_build_lib.py - INFO    : SimpleTestVerify: Running smoke suite to verify image.
Stopping the KVM instance
Sep 4, 2013
#58 akes...@chromium.org
I just landed https://chromium-review.googlesource.com/#/c/168120/ , which I hope fixes this. If we don't see any more timeout during banner exchange failures in the next few days, I'll mark this as fixed.
Sep 4, 2013
#59 dgarrett@chromium.org
Could that hide legitimate ssh errors?
Sep 4, 2013
#60 akes...@chromium.org
This CL won't hide any errors. It will hopefully eliminate some.

(it's not a VMTest retrying CL, it's a don't-try-to-use-broken-ssh-keys-CL)
Sep 4, 2013
#61 akes...@chromium.org
I have to temporarily revert my (possible-) fix to VMTest. It is breaking the incremental builders. Need to wait until my autotest changes make it through the CQ.
Sep 4, 2013
#62 akes...@chromium.org
The (possible-) fix is back in.
Sep 4, 2013
#63 akes...@chromium.org
Gah. This may have happened again, post-"fix". http://chromegw/i/chromiumos/builders/x86%20generic%20incremental/builds/13662

Will investigate tomorrow.
Sep 5, 2013
#65 akes...@chromium.org
Typical failure mode:

22:46:31 INFO | autoserv| [stderr] OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
22:46:31 INFO | autoserv| [stderr] debug1: Reading configuration data /dev/null
22:46:31 INFO | autoserv| [stderr] debug1: Control socket "/tmp/_autotmp_oWOo1mssh-master/socket" does not exist
22:46:31 INFO | autoserv| [stderr] debug2: ssh_connect: needpriv 0
22:46:31 INFO | autoserv| [stderr] debug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.
22:46:31 INFO | autoserv| [stderr] debug2: fd 25 setting O_NONBLOCK
22:46:31 INFO | autoserv| [stderr] debug1: fd 25 clearing O_NONBLOCK
22:46:31 INFO | autoserv| [stderr] debug1: Connection established.
22:46:31 INFO | autoserv| [stderr] debug3: timeout: 30000 ms remain after connect
22:46:31 INFO | autoserv| [stderr] debug3: Not a RSA1 key file /dev/null.
22:46:31 INFO | autoserv| [stderr] debug1: identity file /dev/null type -1
22:47:01 INFO | autoserv| [stderr] Connection timed out during banner exchange


Normal output:
22:46:31 INFO | autoserv| [stderr] OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
22:46:31 INFO | autoserv| [stderr] debug1: Reading configuration data /dev/null
22:46:31 INFO | autoserv| [stderr] debug1: Control socket "/tmp/_autotmp_oWOo1mssh-master/socket" does not exist
22:46:31 INFO | autoserv| [stderr] debug2: ssh_connect: needpriv 0
22:46:31 INFO | autoserv| [stderr] debug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.
22:46:31 INFO | autoserv| [stderr] debug2: fd 25 setting O_NONBLOCK
22:46:31 INFO | autoserv| [stderr] debug1: fd 25 clearing O_NONBLOCK
22:46:31 INFO | autoserv| [stderr] debug1: Connection established.
22:46:31 INFO | autoserv| [stderr] debug3: timeout: 30000 ms remain after connect
22:46:31 INFO | autoserv| [stderr] debug3: Not a RSA1 key file /dev/null.
22:46:31 INFO | autoserv| [stderr] debug1: identity file /dev/null type -1
22:46:31 INFO | autoserv| [stderr] debug1: Remote protocol version 2.0, remote software version OpenSSH_5.2
22:46:31 INFO | autoserv| [stderr] debug1: match: OpenSSH_5.2 pat OpenSSH*




I don't see any instances of normal output where there is any delay between
22:47:42 INFO | autoserv| [stderr] debug1: identity file /dev/null type -1
and
22:47:42 INFO | autoserv| [stderr] debug1: Remote protocol version 2.0, remote software version OpenSSH_5.2
so it doesn't look like we can fix the timeout by increasing the ConnectionTimeout option (currently 30s). Increasing ConnectionAttempts (currently 4) will not help either, this failure is occuring after a connection is established but before the "banner exchange" handshake.
Sep 5, 2013
#66 akes...@chromium.org
The banner exchange message is being emitted by openssh-5.2.tar.gz/ssh/sshconnect.c , in the function `static void ssh_exchange_identification(int timeout_ms)`. The snippet looks like:
                                ms_to_timeval(&t_remaining, remaining);
                                FD_SET(connection_in, fdset);
                                rc = select(connection_in + 1, fdset, NULL,
                                    fdset, &t_remaining);
                                ms_subtract_diff(&t_start, &remaining);
                                if (rc == 0 || remaining <= 0)
                                        fatal("Connection timed out during "
                                            "banner exchange");

Nothing particularly surprising or fishy there. The select() call is simply timing out while waiting for data to read from connection_in.
Sep 5, 2013
#67 akes...@chromium.org
Looks like the cause of this is some sort of flakey hang in either sshd on the VM (unlikely), or in the VM or VM port forward or VM network stack (more likely). Some small fraction of TCP sockets are being successfully opened and then immediately hanging, manifesting as this ssh banner exchange error. Sockets opened immediately thereafter do not seem to be affected.

While digging in the logs, I noticed that though we are configured to use master-ssh connecions in VMTest, the autoserv's individual ssh commands are failing to use the master ssh socket, and hence each establishing their own ssh connection. This means that whatever rare flake is responsible is being exercised ~100 times per test. Getting the master-ssh connection to be correctly reused should substantially decrease the rate of occurance of this failure, while doing nothing about the root cause.

If that proves difficult to do, having VMTest retry tests at a test-by-test level will also help.
Sep 5, 2013
#68 akes...@chromium.org
Blocking on a fix to the master-ssh connection not always being used bug.
Blockedon: chromium:286014
Sep 5, 2013
#69 bugdro...@chromium.org
Project: chromiumos/platform/crostestutils
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : 7d2c624d9a5ab65c4268967c8843a07bcdc89c29

Code-Review  0 : Aviv Keshet
Code-Review  +2: Chris Sosa
Commit-Queue 0 : Chris Sosa
Commit-Queue +1: Aviv Keshet
Verified     0 : Chris Sosa
Verified     +1: Aviv Keshet
Commit Queue   : Chumped
Change-Id      : Iadfa46c9fc843fa2e86ff2eb901018b096b24958
Reviewed-at    : chromium-review.googlesource.com/168120

Run VMTests with "-F /dev/null -i /dev/null" ssh arguments

These arguments will disable a lot of unncessary parsing and checking of
malformed keys in the ~/.ssh/ directory, which I believe are responsible
for the `Connection timed out during banner exchange` failures we are
seeing in VMTest.

CQ-DEPEND=CL:Ib19fe8c378ae71f5c5432f82f919fd761a10ca80
BUG=chromium:254166
TEST=Trybot

M  cros_run_vm_test
Sep 6, 2013
#70 akes...@chromium.org
https://chromium-review.googlesource.com/#/c/168302/ is now merged. This should reduce the frequency of VMTest flake, by being more aggressive about reviving dead master-ssh connections (and therefore resulting in fewer total ssh connections).
Sep 6, 2013
#71 thie...@chromium.org
(No comment was entered for this change.)
Cc: -thie...@chromium.org
Sep 9, 2013
#72 akes...@chromium.org
Apparent cause:
Some fraction of attempts to open a TCP socket to the VM DUT's ssh sever will allow a connection to be established but then hang the socket on the first server->client data transmission. This causes the ssh client to die with "Connection timed out during banner exchange.", while waiting for the first piece of server->client data in the ssh protocol. [however, I have not been able to reproduce this locally, running a VM on my desktop]

Aggravated by:
Some fraction of the time, autotest's master-ssh connection will die in an unusual way (deleting it's /tmp/[tmpdirectory]/socket file, but not terminating; possibly hanging during its shutdown). Prior to https://chromium-review.googlesource.com/#/c/168302/ this would not be detected as a dead master-ssh connection, so the all the following ssh commands for a given test would take place on their own individual ssh connections. The rapid number of opened and closed ssh sockets would then often hit the above flake.


https://chromium-review.googlesource.com/#/c/168302/ makes autotest revive its master ssh connection in the above circumstance, and seems to have either eliminated or dramatically reduced the rate at which we hit this flake during the autotest phase of VMTest. We are still occasionally hitting "Connection timed out during banner exchange" right after VM start-up, before autotest.


Not marking as fixed, because we haven't really traced the root cause of hung sockets. But bumping down the priority, since this is no longer closing the tree several times per day.
Labels: -Pri-0 Pri-2
Sep 10, 2013
#73 be...@chromium.org
we might benefit from backporting any fixes we have for this as it looks like we're hitting this on older branches as well:

https://uberchromegw.corp.google.com/i/chromiumos.release/builders/butterfly%20full%20release-R30-4537.B/builds/49

2013/09/10 04:49:17 - devinstall_test.py - WARNING : return code: 255
Connection timed out during banner exchange
Sep 10, 2013
#74 akes...@chromium.org
That branch failure looks like ssh connection failure during vm startup, not during the autotest phase; the current fixes won't help.
Sep 12, 2013
#77 akes...@chromium.org
The rate of failure has gone down significantly, but we are still seeing sporadic VMTest failures with "Connection timed out due to banner exchange." Most of the time, we are now seeing this immediately after VM startup (see https://code.google.com/p/chromium/issues/detail?id=288757 ).

https://code.google.com/p/chromium/issues/detail?id=288757 is in fact a good clue that this might be caused or exacerbated by too much load on the buildslaves.
Sep 12, 2013
#78 akes...@chromium.org
 Issue 288757  has been merged into this issue.
Cc: sosa@chromium.org cmp@chromium.org petermayo@chromium.org
Sep 12, 2013
#79 akes...@chromium.org
(No comment was entered for this change.)
Summary: VMTest fails due to "Connection timed out due to banner exchange." (was: SSH connectivity drops temporarily in VMTests)
Sep 12, 2013
#80 scottz@chromium.org
Looking at x86-mario incremental's system stats around the time the VM starts running/ultimately fails the system doesn't seem to be too taxed:

DSK |          sda  | busy     23%  | read       4  | write 267931 |  KiB/w     51 |  MBr/s   0.00 |  MBw/s  22.63 |  avio 0.52 ms |


CPU | sys     833%  | user    340%  | irq       1%  | idle   3586% |  wait     40% |  guest     0% |  curf 2.30GHz |  curscal   ?% |


Remind me again what exactly is going on here at this point. We are running ctest which kicks off two vmtest's in parallel, one with an update and one with just the normal test run right? 
Sep 18, 2013
#81 pstew@chromium.org
Another instance of this failure is in http://chromegw/i/chromiumos/builders/amd64%20generic%20ASAN/builds/7263/steps/VMTest/logs/stdio
Sep 25, 2013
#83 sabercro...@google.com
Another:

http://chromegw/i/chromiumos/builders/x86%20generic%20ASAN/builds/13929/steps/VMTest/logs/stdio
Sep 25, 2013
#84 be...@chromium.org
#83 that went through a couple of timeouts, eventually connected and ran tests. I believe it failed because of platform_CrosDisksDBus.

/tmp/cbuildbote2cdM1/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-04-platform_CrosDisksDBus                                                [  FAILED  ]

Sep 26, 2013
#86 benhe...@google.com
Moving all non essential bugs to the next Milestone.
Labels: M-32 MovedFrom-31
Sep 27, 2013
#87 rongch...@chromium.org
http://chromegw/i/chromeos/builders/mario%20incremental/builds/16216

stdio:
======================================================================
ERROR: SimpleTestUpdateAndVerify (crostestutils.au_test_harness.au_test.AUTest)
Test that updates to itself.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/b/cbuild/mario-incremental-bin-master/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/au_test.py", line 184, in SimpleTestUpdateAndVerify
    self.worker.PerformUpdate(self.target_image_path, self.target_image_path)
  File "/b/cbuild/mario-incremental-bin-master/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/au_worker.py", line 138, in PerformUpdate
    key_to_use)
  File "/b/cbuild/mario-incremental-bin-master/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/vm_au_worker.py", line 78, in UpdateImage
    self.RunUpdateCmd(cmd, log_directory)
  File "/b/cbuild/mario-incremental-bin-master/src/platform/crostestutils/au_test_harness/../../crostestutils/au_test_harness/au_worker.py", line 212, in RunUpdateCmd
    raise update_exception.UpdateException(result.returncode, 'Update failed')
UpdateException: Update failed

----------------------------------------------------------------------
Ran 1 test in 386.352s


update.log:
Starting a KVM instance
QEMU 1.0 monitor - type 'help' for more information
(qemu) INFO    : KVM started with pid stored in /tmp/kvm.9227
Could not initiate first contact with remote host
Connection timed out during banner exchange
Connection timed out during banner exchange
Failed to connect to virtual machine, retrying ... 
Persist requested.  Use --ssh_port 9227  --kvm_pid /tmp/kvm.9227 to re-connect to it.
Using a pre-created KVM instance specified by /tmp/kvm.9227.
Could not initiate first contact with remote host
Connection timed out during banner exchange
Connection timed out during banner exchange
Failed to connect to virtual machine, retrying ... 
Persist requested.  Use --ssh_port 9227  --kvm_pid /tmp/kvm.9227 to re-connect to it.
Oct 4, 2013
#89 djku...@chromium.org
And another (x86 generic incremental): http://build.chromium.org/p/chromiumos/builders/x86%20generic%20incremental/builds/14224/steps/VMTest/logs/stdio

01:26:30 INFO | autoserv| Starting master ssh connection '/usr/bin/ssh -a -x -vvv -F /dev/null -i /dev/null -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_VrFR9Bssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
01:26:30 INFO | autoserv| Running '/usr/bin/ssh -a -x -vvv -F /dev/null -i /dev/null -N -o ControlMaster=yes -o ControlPath=/tmp/_autotmp_VrFR9Bssh-master/socket -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4 -o Protocol=2 -l root -p 9227 127.0.0.1'
01:26:30 INFO | autoserv| [stderr] OpenSSH_5.2p1, OpenSSL 1.0.1c 10 May 2012
01:26:30 INFO | autoserv| [stderr] debug1: Reading configuration data /dev/null
01:26:30 INFO | autoserv| [stderr] debug1: Control socket "/tmp/_autotmp_VrFR9Bssh-master/socket" does not exist
01:26:30 INFO | autoserv| [stderr] debug2: ssh_connect: needpriv 0
01:26:30 INFO | autoserv| [stderr] debug1: Connecting to 127.0.0.1 [127.0.0.1] port 9227.
01:26:30 INFO | autoserv| [stderr] debug2: fd 25 setting O_NONBLOCK
01:26:30 INFO | autoserv| [stderr] debug1: fd 25 clearing O_NONBLOCK
01:26:30 INFO | autoserv| [stderr] debug1: Connection established.
01:26:30 INFO | autoserv| [stderr] debug3: timeout: 30000 ms remain after connect
01:26:30 INFO | autoserv| [stderr] debug3: Not a RSA1 key file /dev/null.
01:26:30 INFO | autoserv| [stderr] debug1: identity file /dev/null type -1
01:27:00 INFO | autoserv| [stderr] Connection timed out during banner exchange
01:27:00 INFO | autoserv| Exception escaped control file, job aborting:
01:27:00 INFO | autoserv| Traceback (most recent call last):
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/server_job.py", line 616, in run
01:27:00 INFO | autoserv| self._execute_code(server_control_file, namespace)
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/server_job.py", line 1112, in _execute_code
01:27:00 INFO | autoserv| execfile(code_file, namespace, namespace)
01:27:00 INFO | autoserv| File "/tmp/cbuildbot9q2l8J/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-07-security_Minijail_seccomp/control.srv", line 10, in <module>
01:27:00 INFO | autoserv| job.parallel_simple(run_client, machines)
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/server_job.py", line 510, in parallel_simple
01:27:00 INFO | autoserv| return_results=return_results)
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/subcommand.py", line 92, in parallel_simple
01:27:00 INFO | autoserv| function(arg)
01:27:00 INFO | autoserv| File "/tmp/cbuildbot9q2l8J/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-07-security_Minijail_seccomp/control.srv", line 6, in run_client
01:27:00 INFO | autoserv| host.log_kernel()
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 576, in log_kernel
01:27:00 INFO | autoserv| kernel = self.get_kernel_ver()
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/client/common_lib/hosts/base_classes.py", line 535, in get_kernel_ver
01:27:00 INFO | autoserv| return self.run('/bin/uname -r').stdout.rstrip()
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/hosts/ssh_host.py", line 136, in run
01:27:00 INFO | autoserv| options, stdin, args, ignore_timeout)
01:27:00 INFO | autoserv| File "/build/x86-generic/usr/local/autotest/server/hosts/ssh_host.py", line 104, in _run
01:27:00 INFO | autoserv| raise error.AutoservRunError("command execution error", result)
01:27:00 INFO | autoserv| AutoservRunError: command execution error
01:27:00 INFO | autoserv| * Command:
01:27:00 INFO | autoserv| /usr/bin/ssh -a -x -vvv -F /dev/null -i /dev/null  -o ControlPath=/tmp
01:27:00 INFO | autoserv| /_autotmp_VrFR9Bssh-master/socket -o StrictHostKeyChecking=no -o
01:27:00 INFO | autoserv| UserKnownHostsFile=/dev/null -o BatchMode=yes -o ConnectTimeout=30 -o
01:27:00 INFO | autoserv| ServerAliveInterval=180 -o ServerAliveCountMax=3 -o ConnectionAttempts=4
01:27:00 INFO | autoserv| -o Protocol=2 -l root -p 9227 127.0.0.1 "export LIBC_FATAL_STDERR_=1;
01:27:00 INFO | autoserv| /bin/uname -r"
01:27:00 INFO | autoserv| Exit status: 255
01:27:00 INFO | autoserv| Duration: 30.0380280018

Oct 4, 2013
#91 djku...@chromium.org
Third time today, something seems consistently wrong, upping priority:
http://build.chromium.org/p/chromiumos/builders/amd64%20generic%20ASAN/builds/7496/steps/VMTest/logs/stdio
Labels: -Pri-2 Pri-0
Oct 4, 2013
#92 akes...@chromium.org
"Connection timed out due to banner exchange." is caused, to the best of my knowledge, by a TCP socket hanging immediately after being opened. This bug needs the attention of someone who understands the network stack and the VM network stack in particular. I've done all I can here.
Oct 8, 2013
#93 akes...@chromium.org
Some possible mitigations that I will try to work on this iteration (after which I'll toss this bug back into the lab pool):
  - Better retrying logic for the initial ssh-ping that ctest does.
  - Running the two VM Tests serially rather than in parallel. We have seen cases where the VM socket hanging is exacerbated by heavy load on the builder, so this may reduce the flake frequency.
  - Eliminating one of SimpleTestVerify or SimpleTestUpdateAndVerify. It's not clear to me what extra coverage we get by running both of them.
Status: Available
Owner: ---
Labels: -Pri-0 Pri-1 Iteration-92
Oct 9, 2013
#94 scottz@chromium.org
This happened again today: 

http://chromegw/i/chromiumos/builders/amd64%20generic%20ASAN/builds/7562

Aviv is looking at doing some of the mitigations above today.
Status: Assigned
Owner: akes...@chromium.org
Oct 9, 2013
#95 rharrison@chromium.org
Another occurrence:
https://chromegw/i/chromeos/builders/mario%20incremental/builds/16459
Oct 9, 2013
#96 bugdro...@chromium.org
Project: chromiumos/platform/crostestutils
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : d1ed2374cd5770191ad5959b4fdd86fc2a2ce2ae

Code-Review  0 : Aviv Keshet, Scott Zawalski, chrome-internal-fetch
Code-Review  +2: Chris Sosa
Commit-Queue 0 : Chris Sosa, Scott Zawalski, chrome-internal-fetch
Commit-Queue +1: Aviv Keshet
Verified     0 : Chris Sosa, Scott Zawalski, chrome-internal-fetch
Verified     +1: Aviv Keshet
Change-Id      : I3d98b5c1fb5163f82320c18f4ac8510d305ba9c3
Reviewed-at    : https://chromium-review.googlesource.com/172460

cros_au_test_harness: Run VM tests serially

BUG=chromium:254166
TEST=Tryjob

M  au_test_harness/cros_au_test_harness.py
Oct 11, 2013
#97 scottz@chromium.org
Have we seen any failures since this change?
Oct 11, 2013
#98 akes...@chromium.org
No, I don't think so. Looks like serializing helped a lot. (a lot of builds in the last 2 days haven't made it to VMTest due to build_packages failure, so I'm not ready to call it definitive).
Oct 14, 2013
#99 bugdro...@chromium.org
Project: chromiumos/platform/crostestutils
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : f4c4b7015034f8515b2fea68add19b1227dd5e1f

Code-Review  0 : Aviv Keshet, chrome-internal-fetch
Code-Review  +2: Chris Sosa
Commit-Queue 0 : Chris Sosa, chrome-internal-fetch
Commit-Queue +1: Aviv Keshet
Verified     0 : Chris Sosa, chrome-internal-fetch
Verified     +1: Aviv Keshet
Change-Id      : I88fb93acbc4e80ce9bb6799fc23687a9d2e8f5c9
Reviewed-at    : https://chromium-review.googlesource.com/172945

cros_run_vm_test: turn down ssh verbosity in VMTest

The ssh verbosity was turned up to 3 in VMTest to help trace the
referenced bug ("banner exchange" flake). The issue seems to be mostly
mitigated now, and from an ssh log standpoint we already know all we
need to know. This CL turns the verbosity back down, to reduce the
amount of noise in the logs.

BUG=chromium:254166
TEST=None

M  cros_run_vm_test
Oct 15, 2013
#100 akes...@chromium.org
Sigh. We hit the banner exchange flake again, on parrot64 canary:

http://chromegw/i/chromeos/builders/parrot64%20canary/builds/382

So, the mitigation seemed to help a lot, but we're not out of the woods.
Oct 15, 2013
#101 bugdro...@chromium.org
Project: chromiumos/platform/crostestutils
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : 5dfd56356c5cf1c7331964ee3240bd1769e76def

Code-Review  0 : Aviv Keshet, chrome-internal-fetch
Code-Review  +2: Chris Sosa
Commit-Queue 0 : Chris Sosa, chrome-internal-fetch
Commit-Queue +1: Aviv Keshet
Verified     0 : Chris Sosa, chrome-internal-fetch
Verified     +1: Aviv Keshet
Change-Id      : I912780a72b3b2d66afdf627613b84e256c5f27b3
Reviewed-at    : https://chromium-review.googlesource.com/172928

cros_run_vm_test: log builder load/uptime stats

It is believed that VMTest banner exchange flake is associated with
builder load. This CL logs the builder load right before starting
the autotest phase of VMTest.

BUG=chromium:254166
TEST=None

M  cros_run_vm_test
Oct 15, 2013
#102 bugdro...@chromium.org
Project: chromiumos/third_party/autotest
Branch : master
Author : Aviv Keshet <akeshet@chromium.org>
Commit : 31b9ec8dd07f5321ba7823420cb5ea7bea3acfd9

Code-Review  0 : Alex Miller, Scott Zawalski, chrome-internal-fetch
Code-Review  +2: Aviv Keshet
Commit-Queue 0 : Alex Miller, Scott Zawalski, chrome-internal-fetch
Commit-Queue +1: Aviv Keshet
Verified     0 : Alex Miller, Scott Zawalski, chrome-internal-fetch
Verified     +1: Aviv Keshet
Change-Id      : Ie6f203230ac90ae4b5e3b8dd5cc589c37f0d0beb
Reviewed-at    : https://chromium-review.googlesource.com/172929

[autotest] abstract.py: log system load before starting master-ssh

It is believed that builder load contributes to VMTest banner exchange
flake, which occurs almost exclusively during either master-ssh
connection establishment, or during the first ssh-ping after booting
a VM. This CL logs the system load right before each master-ssh
connection establishment.

BUG=chromium:254166
TEST=None

M  server/hosts/abstract_ssh.py
Oct 16, 2013
#103 scottz@chromium.org
 Issue 308297  has been merged into this issue.
Cc: bhthomp...@chromium.org jere...@chromium.org
Oct 18, 2013
#104 scottz@chromium.org
http://chromegw/i/chromiumos/builders/x86-generic%20paladin/builds/16239/steps/VMTest/logs/stdio Looks like we got bit in start up here. 
Oct 18, 2013
#105 scottz@chromium.org
(No comment was entered for this change.)
Labels: -M-32
Oct 21, 2013
#108 be...@chromium.org
#107 those logs indicate a failure in:
/tmp/cbuildbotk0yPLF/test_harness/all/SimpleTestUpdateAndVerify/2_autotest_tests/results-08-security_RestartJob                                              [  FAILED  ]

is that the right link?
Oct 22, 2013
#109 bugdro...@chromium.org
Project: chromiumos/chromite
Branch : master
Author : David James <davidjames@chromium.org>
Commit : 6b7042454efc63d22637180025cdcb555914f2d0

Code-Review  0 : Aviv Keshet, Yu-Ju Hong, chrome-internal-fetch
Code-Review  +2: David James, Mike Frysinger
Commit-Queue 0 : Aviv Keshet, Mike Frysinger, Yu-Ju Hong, chrome-internal-fetch
Commit-Queue +1: David James
Verified     0 : Aviv Keshet, Mike Frysinger, Yu-Ju Hong, chrome-internal-fetch
Verified     +1: David James
Change-Id      : I49c6ec5cb8255c1d664100a6562959cf1e9aad97
Reviewed-at    : https://chromium-review.googlesource.com/173751

Teach cbuildbot to retry the VMTest stage.

BUG=chromium:309041, chromium:254166
TEST=Add flake, verify it retries.
TEST=Run all unit tests.

M  buildbot/builderstage.py
M  buildbot/cbuildbot_commands.py
M  buildbot/cbuildbot_commands_unittest.py
M  buildbot/cbuildbot_results.py
M  buildbot/cbuildbot_stages.py
M  buildbot/constants.py
M  scripts/cbuildbot.py
Oct 25, 2013
#110 akes...@chromium.org
 Issue 311478  has been merged into this issue.
Cc: scottz@chromium.org sp...@chromium.org sna...@chromium.org dhend...@chromium.org chrome-i...@google.com
Oct 25, 2013
#111 akes...@chromium.org
In above bug, happened twice in a row, so failed with retry. Both times during the initial connection to the VM.
Oct 25, 2013
#112 akes...@chromium.org
This builder is failing every VMTest actually. A builder load problem?
Oct 30, 2013
#113 oshima@chromium.org
FYI: ChromiumOS ASAN bots on main chromium waterfall are having the same issue (basically same as 311478).
Is there an ETA for this issue?


http://build.chromium.org/p/chromium.memory/builders/Chromium%20OS%20(amd64)%20ASAN
http://build.chromium.org/p/chromium.memory/builders/Chromium%20OS%20(x86)%20ASAN

@@@BUILD_STEP VMTest (attempt 2)@@@

************************************************************
** Start Stage VMTest (attempt 2) - 08:51:33
** 
** Run autotests in a virtual machine.
************************************************************
08:51:33: INFO: RunCommand: 'bin/ctest' '--board=x86-generic' '--type=vm' '--no_graphics' '--target_image=/b/cbuild.x86.asan/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--test_results_root=/b/cbuild.x86.asan/chroot/tmp/cbuildbotu75QdX/test_harness' '--quick' '--only_verify' '--suite=smoke' in /b/cbuild.x86.asan/src/scripts
2013/10/30 08:51:34 - ctest - INFO    : Could not find a latest image to use. Using target instead.
2013/10/30 08:51:34 - cros_build_lib.py - INFO    : RunCommand: '/b/cbuild.x86.asan/src/platform/crostestutils/au_test_harness/cros_au_test_harness.py' '--base_image=/b/cbuild.x86.asan/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--target_image=/b/cbuild.x86.asan/src/build/images/x86-generic/latest-cbuildbot/chromiumos_test_image.bin' '--board=x86-generic' '--type=vm' '--remote=0.0.0.0' '--verbose' '--jobs=1' '--verify_suite_name=smoke' '--test_prefix=SimpleTestVerify' '--test_results_root=/b/cbuild.x86.asan/chroot/tmp/cbuildbotu75QdX/test_harness' '--no_graphics' in /b/cbuild.x86.asan/src/scripts
2013/10/30 08:51:35 - cros_build_lib.py - INFO    : No update cache found. Update testing will not work.  Run  cros_generate_update_payloads if this was not intended.
2013/10/30 08:51:35 - cros_build_lib.py - INFO    : SimpleTestVerify: Running smoke suite to verify image.
FStarting a KVM instance
QEMU 1.0 monitor - type 'help' for more information
(qemu) INFO    : KVM started with pid stored in /tmp/kvm.9228
Could not initiate first contact with remote host
Connection timed out during banner exchange
Connection timed out during banner exchange
Failed to connect to virtual machine, retrying ... 
Persist requested.  Use --ssh_port 9228  --kvm_pid /tmp/kvm.9228 to re-connect to it.
Using a pre-created KVM instance specified by /tmp/kvm.9228.
Could not initiate first contact with remote host
Connection timed out during banner exchange
Connection timed out during banner exchange
Failed to connect to virtual machine, retrying ... 
Persist requested.  Use --ssh_port 9228  --kvm_pid /tmp/kvm.9228 to re-connect to it.
Using a pre-created KVM instance specified by /tmp/kvm.9228.
Could not initiate first contact with remote host
Connection timed out during banner exchange
Connection timed out during banner exchange
Failed to connect to virtual machine, retrying ... 
Persist requested.  Use --ssh_port 9228  --kvm_pid /tmp/kvm.9228 to re-connect to it.
Using a pre-created KVM instance specified by /tmp/kvm.9228.
Could not initiate first contact with remote host
Connection timed out during banner exchange
Connection timed out during banner exchange
Persist requested.  Use --ssh_port 9228  --kvm_pid /tmp/kvm.9228 to re-connect to it.

Oct 30, 2013
#114 scottz@chromium.org
Oshima I believe that is bug: https://code.google.com/p/chromium/issues/detail?id=311478, not this one. 
Jan 28, 2014
#115 scottz@chromium.org
This has been fixed for some time now. 
Status: Fixed
Feb 20, 2014
#116 kr...@chromium.org
(No comment was entered for this change.)
Labels: VerifyIn-34
Feb 24, 2014
#117 kr...@chromium.org
(No comment was entered for this change.)
Status: Verified
Sign in to add a comment

Powered by Google Project Hosting