My favorites | Sign in
Project Home Downloads Wiki Issues Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 202298: test harness / kvm flakiness (ssh seems to fail in the VM)
4 people starred this issue and may be notified of changes. Back to list
 
Project Member Reported by tbroch@chromium.org, Aug 15, 2011
Looks like it might related to:
crosbug.com/13027
crosbug.com/13070
crosbug.com/19154
crosbug.com/11629

From:
http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy-binary/builds/444/steps/Test/logs/stdio

INFO: Running tests in test suite in parallel.
Process Pool Active: Waiting on 3/3 jobs to complete
Process Pool Active: Waiting on 2/3 jobs to complete
Process Pool Active: Waiting on 2/3 jobs to complete
Process Pool Active: Waiting on 1/3 jobs to complete
Process Pool Active: Waiting on 1/3 jobs to complete

INFO: testUpdateWipeStateful: Performing a delta update from /b/cbuild/stumpy-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM

INFO: testInterruptedUpdate: Performing a delta update from /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM

INFO: testUpdateKeepStateful: Performing a delta update from /b/cbuild/stumpy-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM

INFO: testDelayedUpdate: Performing a delta update from /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM
Stopping the KVM instance
E
======================================================================
ERROR: Tests what happens if we interrupt payload delivery 3 times.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/au_test.py", line 126, in tearDown
    self.worker.CleanUp()
  File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/vm_au_worker.py", line 35, in CleanUp
    self._KillExistingVM(self._kvm_pid_file)
  File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/vm_au_worker.py", line 29, in _KillExistingVM
    cwd=self.crosutilsbin, print_cmd=False)
  File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/../../../../src/scripts/lib/cros_build_lib.py", line 120, in RunCommand
    raise RunCommandException(error_info)
RunCommandException: Command "['./cros_stop_vm', '--kvm_pid=/tmp/kvm.9224']" failed.


----------------------------------------------------------------------
Ran 1 test in 319.269s

FAILED (errors=1)
Stopping the KVM instance
Sep 9, 2011
#2 r...@chromium.org
Saw a kvm flakiness failure today on zgb binary.

http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-zgb-binary/builds/1157/steps/VMTest/logs/stdio

In the update.log of the failing test (testUpdateWipeStateful) I see these messages:

/b/cbuild/x86-zgb-private-bin/src/platform/dev/host/../../../scripts/remote_access.sh: line 79:  4002 Killed                  check_if_rebooted
ssh_exchange_identification: Connection closed by remote host
Connection timed out during banner exchange
/b/cbuild/x86-zgb-private-bin/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 26284 Killed                  check_if_rebooted

Sep 12, 2011
#3 davidjames@google.com
From http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/317

INFO    : Rebooting.
Warning: Permanently added '[127.0.0.1]:9224' (RSA) to the list of known hosts.
INFO    : Not yet rebooted
Warning: Permanently added '[127.0.0.1]:9224' (RSA) to the list of known hosts.
INFO    : Not yet rebooted
/b/cbuild/stumpy-canary-master/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 28777 Killed                  check_if_rebooted
Warning: Permanently added '[127.0.0.1]:9224' (RSA) to the list of known hosts.
INFO    : Rebooted and responding
INFO    : New updated in /home/chrome-bot/.ssh/known_hosts, backup made.
INFO    : Update was successful and rebooted to 0.16.1031.0 (Official Build) dev-channel stumpy test
Elapsed time: 6m25s
Persist requested.  Use --ssh_port 9224  --kvm_pid /tmp/kvm.9224 to re-connect to it.
ssh_exchange_identification: Connection closed by remote host
Sep 14, 2011
#4 davidjames@chromium.org
http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy-binary/builds/711/steps/VMTest/logs/stdio

parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.

Performing standard stateful update.
INFO    : Rebooting.
Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts.
INFO    : Not yet rebooted
Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts.
INFO    : Not yet rebooted
Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts.
INFO    : Not yet rebooted
Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts.
INFO    : Not yet rebooted
Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts.
INFO    : Not yet rebooted
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host
/b/cbuild/stumpy-private-bin/src/platform/dev/host/../../../scripts/remote_access.sh: line 79:  8267 Killed                  check_if_rebooted
Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts.
INFO    : Rebooted and responding
INFO    : New updated in /home/chrome-bot/.ssh/known_hosts, backup made.
INFO    : Update was successful and rebooted to 0.16.1037.0 (Continuous Build - Builder: N/A) stumpy
Elapsed time: 6m39s
Persist requested.  Use --ssh_port 9225  --kvm_pid /tmp/kvm.9225 to re-connect to it.
ssh_exchange_identification: Connection closed by remote host

... and also ...

INFO    : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE
.....Connection timed out during banner exchange
INFO    : State of updater has changed to: 
Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.Connection timed out during banner exchange
.
Sep 14, 2011
#5 tlamb...@chromium.org
"ssh_exchange_identification: Connection closed by remote host"

Is a result of an ssh configuration error on the server.  The prophylactic method of resolving this issue would be to ensure a clean ssh setup on the server each time.  This can include, but is not limited to corrupt key files or an incorrect DNS configuration, particularly for reverse DNS.

It is possible to identify a large portion (but not all!) corrupt server configurations via:

  /usr/sbin/sshd -t

...which should be part of the kvm log capture in all cases, to assist in post-failure analysis.
Sep 16, 2011
#6 jhorwich@chromium.org
Saw another one similar to the original report:

http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex-binary/builds/1376/steps/VMTest/logs/stdio

INFO: Generating updates required for this test suite in parallel.

INFO: Generating a signed delta update payload from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_qemu_image.bin

INFO: Generating a signed delta update payload from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin

INFO: Generating a signed delta update payload from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin

INFO: Running tests in test suite in parallel.
Process Pool Active: Waiting on 3/3 jobs to complete
Process Pool Active: Waiting on 1/3 jobs to complete
Process Pool Active: Waiting on 1/3 jobs to complete

INFO: testInterruptedUpdate: Performing a delta update from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_qemu_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin in a VM

INFO: testUpdateKeepStateful: Performing a delta update from /b/cbuild/x86-alex-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin in a VM

INFO: testDelayedUpdate: Performing a delta update from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_qemu_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin in a VM in a VM

EStopping the KVM instance
kill: No such process

======================================================================
ERROR: Tests if we can update normally.
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/au_test.py", line 140, in testUpdateKeepStateful
    self.worker.PerformUpdate(self.target_image_path, self.base_image_path)
  File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/au_worker.py", line 134, in PerformUpdate
    proxy_port, key_to_use)
  File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/vm_au_worker.py", line 64, in UpdateImage
    self.RunUpdateCmd(cmd, log_directory)
  File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/au_worker.py", line 215, in RunUpdateCmd
    raise update_exception.UpdateException(1, str(e))
UpdateException: Command "['/b/cbuild/x86-alex-private-bin/src/scripts/lib/../bin/cros_run_vm_update', '--vm_image_path=/b/cbuild/x86-alex-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin', '--update_log=/b/cbuild/x86-alex-private-bin/chroot/tmp/cbuildbotQx4x85/test_harness/testUpdateKeepStateful/1_update/update_engine.log', '--snapshot', '--no_graphics', '--persist', '--kvm_pid=/tmp/kvm.9222', '--ssh_port=9222', '--stateful_update_flag=old', '--update_url=http://172.22.15.176:8080/update/cache/990be4e46044179c41240c596e65cd6f_0dcf5d16762fbc6d84789c60a5f034c7+207bb906877a2667a7c736e1e4cad4b8' failed.

Output logged to /b/cbuild/x86-alex-private-bin/chroot/tmp/cbuildbotQx4x85/test_harness/testUpdateKeepStateful/1_update/update.log

----------------------------------------------------------------------
Ran 1 test in 111.804s

FAILED (errors=1)

Sep 20, 2011
#9 cmasone@chromium.org
(No comment was entered for this change.)
Summary: test harness / kvm flakiness (ssh seems to fail in the VM)
Sep 20, 2011
#10 cmasone@chromium.org
This or something like it closes the tree not-infrequently.  Can someone who understands the parallel VM test harness and/or the AU test harness step up here?
Status: Available
Owner: ---
Cc: sosa@chromium.org dgarrett@chromium.org petkov@chromium.org
Labels: -Pri-1 -Area-Build Pri-0 Area-Test Mstone-R16
Sep 20, 2011
#11 sosa@chromium.org
Don can you take a look?
Status: Assigned
Owner: dgarrett@chromium.org
Sep 21, 2011
#13 dgarrett@chromium.org
My belief is that this error is always being seen in a call to image_to_live.sh from some nested part of the test framework, and that we are seeing the problem primarily because of the heavily loads and large number of invocations to image_to_live.

I also believe the problems relate to detecting the reboot of the VM and if it came up afterwards. If I'm right, this is probably the relevant code....


# Wait function given an additional timeout argument.
# $1 - pid to wait on.
# $2 - timeout to wait for.
function wait_with_timeout() {
  local pid=$1
  local timeout=$2
  local -r TIMEOUT_INC=1
  local current_timeout=0
  while is_pid_running ${pid} && [ ${current_timeout} -lt ${timeout} ]; do
    sleep ${TIMEOUT_INC}
    current_timeout=$((current_timeout + TIMEOUT_INC))
  done
  ! is_pid_running ${pid}
}

# Checks to see if a machine has rebooted using the presence of a tmp file.
function check_if_rebooted() {
  local output_file="${TMP}/output"
  while true; do
    REMOTE_OUT=""
    # This may fail while the machine is down so generate output and a
    # boolean result to distinguish between down/timeout and real failure
    ! remote_sh_allow_changed_host_key \
        "echo 0; [ -e /tmp/awaiting_reboot ] && echo '1'; true"
    echo "${REMOTE_OUT}" > "${output_file}"
    if grep -q "0" "${output_file}"; then
      if grep -q "1" "${output_file}"; then
        info "Not yet rebooted"
        sleep .5
      else
        info "Rebooted and responding"
        break
      fi
    fi
  done
}

function remote_reboot() {
  info "Rebooting."
  remote_sh "touch /tmp/awaiting_reboot; reboot"
  while true; do
    check_if_rebooted &
    local pid=$!
    wait_with_timeout ${pid} 30 && break
    ! kill -9 ${pid} 2> /dev/null
  done
}



Sep 21, 2011
#14 dgarrett@chromium.org
This is output from a successful image_to_live.sh which matches the above logs. Perhaps that killed line is normal?

ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
/usr/local/google/home/dgarrett/chromiumos/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 22999 Killed                  check_if_rebooted
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
ssh: connect to host 172.22.75.45 port 22: Connection refused
Warning: Permanently added '172.22.75.45' (RSA) to the list of known hosts.
INFO    : Rebooted and responding
INFO    : New updated in /home/dgarrett/.ssh/known_hosts, backup made.

Sep 21, 2011
#15 dgarrett@chromium.org
Okay, given the way sshd is configured on CROS machines, if you have 10 sessions that are in the middle of authentication, further connection attempts are closed with this error message.

dgarrett$ssh chronos@172.22.75.45
ssh_exchange_identification: Connection closed by remote host

Sep 22, 2011
#16 dgarrett@chromium.org
An alternative way to gather logs which might help would be to attempt to ssh into a KVM with "-v". I believe that we can do this remotely as long as you know what port the KVM is on. I expect this to fail, but want to see the verbose output. Hopefully, it'll tell us why.

ssh -v chronos@buildbot-yada -p 92xx
Sep 24, 2011
#17 petermayo@chromium.org
20799 may be a duplicate of this issue.
Sep 26, 2011
#18 sosa@chromium.org
The kill line is normal AFAIK because the update engine status is checked in a seperate thread and killed when we receive a completed update_engine_client --update return
Sep 26, 2011
#19 sosa@chromium.org
(No comment was entered for this change.)
Owner: sosa@chromium.org
Cc: -sosa@chromium.org
Sep 26, 2011
#20 sosa@chromium.org
(No comment was entered for this change.)
Status: Started
Sep 26, 2011
#21 sosa@chromium.org
 Issue 20318  has been merged into this issue.
Cc: i...@chromium.org r...@chromium.org bdavi...@chromium.org
Sep 26, 2011
#22 sosa@chromium.org
(No comment was entered for this change.)
Labels: Iteration-39
Sep 26, 2011
#23 bugdroid1@chromium.org
Commit: faeee5f47cac2bfac8b8fbfa45a606b7c8d8d221
 Email: sosa@chromium.org

Add more debugging by re-running ssh commands that failed verbosely.

This CL changes remote_ssh to re-run the command verbosely if the
command failed with an SSH error (code 255).

BUG=chromium-os:19198
TEST=Ran it.  Simulated SSH error by closing KVM instance in cros_run_vm_update
while it was intiializing.  Saw re-run verbosely run correctly.

Change-Id: I9f95aac3fb2d00bc7d26b0b9298409d8ce2db594
Reviewed-on: http://gerrit.chromium.org/gerrit/8313
Commit-Ready: Chris Sosa <sosa@chromium.org>
Reviewed-by: Chris Sosa <sosa@chromium.org>
Tested-by: Chris Sosa <sosa@chromium.org>

M	remote_access.sh
Sep 27, 2011
#24 bugdroid1@chromium.org
Commit: 10af6d3cb034f5c6f1f8d9e709df57ed1c637df7
 Email: sosa@chromium.org

We should not update the hostkey for vm testing.

BUG=chromium-os:19198
TEST=Ran

Change-Id: I40dceb66f14036b22e5ccc8fe665aad4a049ca64
Reviewed-on: http://gerrit.chromium.org/gerrit/8401
Reviewed-by: Chris Sosa <sosa@chromium.org>
Tested-by: Chris Sosa <sosa@chromium.org>

M	cros_run_vm_update
Sep 28, 2011
#25 sosa@chromium.org
Fix submitted in cros_run_vm_update.  If you see again, please re-open
Status: Fixed
Oct 10, 2011
#26 kr...@chromium.org
(No comment was entered for this change.)
Status: Verified
Nov 8, 2011
#27 chromeos...@chromium.org
(No comment was entered for this change.)
Labels: FixedIn-1098.0.0
Jan 9, 2012
#28 chromeos...@chromium.org
(No comment was entered for this change.)
Labels: -FixedIn-1098.0.0 FixedIn-1099.0.0
Jan 20, 2012
#29 chromeos...@chromium.org
(No comment was entered for this change.)
Labels: FixedInIndex-25
Mar 6, 2013
#30 bugdroid1@chromium.org
(No comment was entered for this change.)
Labels: OS-Chrome
Mar 9, 2013
#31 bugdroid1@chromium.org
(No comment was entered for this change.)
Labels: -Area-Test -TreeCloser -Mstone-R16 Cr-Test Hotlist-TreeCloser M-16
Sign in to add a comment

Powered by Google Project Hosting