| Issue 202298: | test harness / kvm flakiness (ssh seems to fail in the VM) | |
| 4 people starred this issue and may be notified of changes. | Back to list |
Sign in to add a comment
|
Looks like it might related to: crosbug.com/13027 crosbug.com/13070 crosbug.com/19154 crosbug.com/11629 From: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy-binary/builds/444/steps/Test/logs/stdio INFO: Running tests in test suite in parallel. Process Pool Active: Waiting on 3/3 jobs to complete Process Pool Active: Waiting on 2/3 jobs to complete Process Pool Active: Waiting on 2/3 jobs to complete Process Pool Active: Waiting on 1/3 jobs to complete Process Pool Active: Waiting on 1/3 jobs to complete INFO: testUpdateWipeStateful: Performing a delta update from /b/cbuild/stumpy-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM INFO: testInterruptedUpdate: Performing a delta update from /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM INFO: testUpdateKeepStateful: Performing a delta update from /b/cbuild/stumpy-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM INFO: testDelayedUpdate: Performing a delta update from /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_qemu_image.bin to /b/cbuild/stumpy-private-bin/src/build/images/stumpy/latest-cbuildbot/chromiumos_test_image.bin in a VM Stopping the KVM instance E ====================================================================== ERROR: Tests what happens if we interrupt payload delivery 3 times. ---------------------------------------------------------------------- Traceback (most recent call last): File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/au_test.py", line 126, in tearDown self.worker.CleanUp() File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/vm_au_worker.py", line 35, in CleanUp self._KillExistingVM(self._kvm_pid_file) File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/vm_au_worker.py", line 29, in _KillExistingVM cwd=self.crosutilsbin, print_cmd=False) File "/b/cbuild/stumpy-private-bin/src/platform/crostestutils/au_test_harness/../../../../src/scripts/lib/cros_build_lib.py", line 120, in RunCommand raise RunCommandException(error_info) RunCommandException: Command "['./cros_stop_vm', '--kvm_pid=/tmp/kvm.9224']" failed. ---------------------------------------------------------------------- Ran 1 test in 319.269s FAILED (errors=1) Stopping the KVM instance
Aug 31, 2011
#1
ellyjo...@chromium.org
Labels:
-Pri-2 Pri-1
Sep 9, 2011
Saw a kvm flakiness failure today on zgb binary. http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-zgb-binary/builds/1157/steps/VMTest/logs/stdio In the update.log of the failing test (testUpdateWipeStateful) I see these messages: /b/cbuild/x86-zgb-private-bin/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 4002 Killed check_if_rebooted ssh_exchange_identification: Connection closed by remote host Connection timed out during banner exchange /b/cbuild/x86-zgb-private-bin/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 26284 Killed check_if_rebooted
Sep 12, 2011
From http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/317 INFO : Rebooting. Warning: Permanently added '[127.0.0.1]:9224' (RSA) to the list of known hosts. INFO : Not yet rebooted Warning: Permanently added '[127.0.0.1]:9224' (RSA) to the list of known hosts. INFO : Not yet rebooted /b/cbuild/stumpy-canary-master/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 28777 Killed check_if_rebooted Warning: Permanently added '[127.0.0.1]:9224' (RSA) to the list of known hosts. INFO : Rebooted and responding INFO : New updated in /home/chrome-bot/.ssh/known_hosts, backup made. INFO : Update was successful and rebooted to 0.16.1031.0 (Official Build) dev-channel stumpy test Elapsed time: 6m25s Persist requested. Use --ssh_port 9224 --kvm_pid /tmp/kvm.9224 to re-connect to it. ssh_exchange_identification: Connection closed by remote host
Sep 14, 2011
http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy-binary/builds/711/steps/VMTest/logs/stdio parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion. Performing standard stateful update. INFO : Rebooting. Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts. INFO : Not yet rebooted Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts. INFO : Not yet rebooted Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts. INFO : Not yet rebooted Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts. INFO : Not yet rebooted Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts. INFO : Not yet rebooted ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host /b/cbuild/stumpy-private-bin/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 8267 Killed check_if_rebooted Warning: Permanently added '[127.0.0.1]:9225' (RSA) to the list of known hosts. INFO : Rebooted and responding INFO : New updated in /home/chrome-bot/.ssh/known_hosts, backup made. INFO : Update was successful and rebooted to 0.16.1037.0 (Continuous Build - Builder: N/A) stumpy Elapsed time: 6m39s Persist requested. Use --ssh_port 9225 --kvm_pid /tmp/kvm.9225 to re-connect to it. ssh_exchange_identification: Connection closed by remote host ... and also ... INFO : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE .....Connection timed out during banner exchange INFO : State of updater has changed to: Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .Connection timed out during banner exchange .
Sep 14, 2011
"ssh_exchange_identification: Connection closed by remote host" Is a result of an ssh configuration error on the server. The prophylactic method of resolving this issue would be to ensure a clean ssh setup on the server each time. This can include, but is not limited to corrupt key files or an incorrect DNS configuration, particularly for reverse DNS. It is possible to identify a large portion (but not all!) corrupt server configurations via: /usr/sbin/sshd -t ...which should be part of the kvm log capture in all cases, to assist in post-failure analysis.
Sep 16, 2011
Saw another one similar to the original report: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex-binary/builds/1376/steps/VMTest/logs/stdio INFO: Generating updates required for this test suite in parallel. INFO: Generating a signed delta update payload from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_qemu_image.bin INFO: Generating a signed delta update payload from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin INFO: Generating a signed delta update payload from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin INFO: Running tests in test suite in parallel. Process Pool Active: Waiting on 3/3 jobs to complete Process Pool Active: Waiting on 1/3 jobs to complete Process Pool Active: Waiting on 1/3 jobs to complete INFO: testInterruptedUpdate: Performing a delta update from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_qemu_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin in a VM INFO: testUpdateKeepStateful: Performing a delta update from /b/cbuild/x86-alex-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin in a VM INFO: testDelayedUpdate: Performing a delta update from /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_qemu_image.bin to /b/cbuild/x86-alex-private-bin/src/scripts/ctest_target/chromiumos_test_image.bin in a VM in a VM EStopping the KVM instance kill: No such process ====================================================================== ERROR: Tests if we can update normally. ---------------------------------------------------------------------- Traceback (most recent call last): File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/au_test.py", line 140, in testUpdateKeepStateful self.worker.PerformUpdate(self.target_image_path, self.base_image_path) File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/au_worker.py", line 134, in PerformUpdate proxy_port, key_to_use) File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/vm_au_worker.py", line 64, in UpdateImage self.RunUpdateCmd(cmd, log_directory) File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/au_worker.py", line 215, in RunUpdateCmd raise update_exception.UpdateException(1, str(e)) UpdateException: Command "['/b/cbuild/x86-alex-private-bin/src/scripts/lib/../bin/cros_run_vm_update', '--vm_image_path=/b/cbuild/x86-alex-private-bin/src/scripts/latest_download/chromiumos_qemu_image.bin', '--update_log=/b/cbuild/x86-alex-private-bin/chroot/tmp/cbuildbotQx4x85/test_harness/testUpdateKeepStateful/1_update/update_engine.log', '--snapshot', '--no_graphics', '--persist', '--kvm_pid=/tmp/kvm.9222', '--ssh_port=9222', '--stateful_update_flag=old', '--update_url=http://172.22.15.176:8080/update/cache/990be4e46044179c41240c596e65cd6f_0dcf5d16762fbc6d84789c60a5f034c7+207bb906877a2667a7c736e1e4cad4b8' failed. Output logged to /b/cbuild/x86-alex-private-bin/chroot/tmp/cbuildbotQx4x85/test_harness/testUpdateKeepStateful/1_update/update.log ---------------------------------------------------------------------- Ran 1 test in 111.804s FAILED (errors=1)
Sep 20, 2011
This might be another similar one: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-zgb-binary/builds/1326/steps/VMTest/logs/stdio
Sep 20, 2011
This one too: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy-binary/builds/769/steps/VMTest/logs/stdio
Sep 20, 2011
(No comment was entered for this change.)
Summary:
test harness / kvm flakiness (ssh seems to fail in the VM)
Sep 20, 2011
This or something like it closes the tree not-infrequently. Can someone who understands the parallel VM test harness and/or the AU test harness step up here?
Status:
Available
Owner: --- Cc: sosa@chromium.org dgarrett@chromium.org petkov@chromium.org Labels: -Pri-1 -Area-Build Pri-0 Area-Test Mstone-R16
Sep 20, 2011
Don can you take a look?
Status:
Assigned
Owner: dgarrett@chromium.org
Sep 21, 2011
alex-binary build failed with this symptom in it's log:
http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-zgb-binary/builds/1344/steps/VMTest/logs/stdio
Log to full build:
http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-zgb-binary/builds/1344
Sep 21, 2011
My belief is that this error is always being seen in a call to image_to_live.sh from some nested part of the test framework, and that we are seeing the problem primarily because of the heavily loads and large number of invocations to image_to_live.
I also believe the problems relate to detecting the reboot of the VM and if it came up afterwards. If I'm right, this is probably the relevant code....
# Wait function given an additional timeout argument.
# $1 - pid to wait on.
# $2 - timeout to wait for.
function wait_with_timeout() {
local pid=$1
local timeout=$2
local -r TIMEOUT_INC=1
local current_timeout=0
while is_pid_running ${pid} && [ ${current_timeout} -lt ${timeout} ]; do
sleep ${TIMEOUT_INC}
current_timeout=$((current_timeout + TIMEOUT_INC))
done
! is_pid_running ${pid}
}
# Checks to see if a machine has rebooted using the presence of a tmp file.
function check_if_rebooted() {
local output_file="${TMP}/output"
while true; do
REMOTE_OUT=""
# This may fail while the machine is down so generate output and a
# boolean result to distinguish between down/timeout and real failure
! remote_sh_allow_changed_host_key \
"echo 0; [ -e /tmp/awaiting_reboot ] && echo '1'; true"
echo "${REMOTE_OUT}" > "${output_file}"
if grep -q "0" "${output_file}"; then
if grep -q "1" "${output_file}"; then
info "Not yet rebooted"
sleep .5
else
info "Rebooted and responding"
break
fi
fi
done
}
function remote_reboot() {
info "Rebooting."
remote_sh "touch /tmp/awaiting_reboot; reboot"
while true; do
check_if_rebooted &
local pid=$!
wait_with_timeout ${pid} 30 && break
! kill -9 ${pid} 2> /dev/null
done
}
Sep 21, 2011
This is output from a successful image_to_live.sh which matches the above logs. Perhaps that killed line is normal? ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused /usr/local/google/home/dgarrett/chromiumos/src/platform/dev/host/../../../scripts/remote_access.sh: line 79: 22999 Killed check_if_rebooted ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused ssh: connect to host 172.22.75.45 port 22: Connection refused Warning: Permanently added '172.22.75.45' (RSA) to the list of known hosts. INFO : Rebooted and responding INFO : New updated in /home/dgarrett/.ssh/known_hosts, backup made.
Sep 21, 2011
Okay, given the way sshd is configured on CROS machines, if you have 10 sessions that are in the middle of authentication, further connection attempts are closed with this error message. dgarrett$ssh chronos@172.22.75.45 ssh_exchange_identification: Connection closed by remote host
Sep 22, 2011
An alternative way to gather logs which might help would be to attempt to ssh into a KVM with "-v". I believe that we can do this remotely as long as you know what port the KVM is on. I expect this to fail, but want to see the verbose output. Hopefully, it'll tell us why. ssh -v chronos@buildbot-yada -p 92xx
Sep 24, 2011
20799 may be a duplicate of this issue.
Sep 26, 2011
The kill line is normal AFAIK because the update engine status is checked in a seperate thread and killed when we receive a completed update_engine_client --update return
Sep 26, 2011
(No comment was entered for this change.)
Owner:
sosa@chromium.org
Cc: -sosa@chromium.org
Sep 26, 2011
(No comment was entered for this change.)
Status:
Started
Sep 26, 2011
Issue 20318 has been merged into this issue.
Cc:
i...@chromium.org r...@chromium.org bdavi...@chromium.org
Sep 26, 2011
(No comment was entered for this change.)
Labels:
Iteration-39
Sep 26, 2011
Commit: faeee5f47cac2bfac8b8fbfa45a606b7c8d8d221 Email: sosa@chromium.org Add more debugging by re-running ssh commands that failed verbosely. This CL changes remote_ssh to re-run the command verbosely if the command failed with an SSH error (code 255). BUG=chromium-os:19198 TEST=Ran it. Simulated SSH error by closing KVM instance in cros_run_vm_update while it was intiializing. Saw re-run verbosely run correctly. Change-Id: I9f95aac3fb2d00bc7d26b0b9298409d8ce2db594 Reviewed-on: http://gerrit.chromium.org/gerrit/8313 Commit-Ready: Chris Sosa <sosa@chromium.org> Reviewed-by: Chris Sosa <sosa@chromium.org> Tested-by: Chris Sosa <sosa@chromium.org> M remote_access.sh
Sep 27, 2011
Commit: 10af6d3cb034f5c6f1f8d9e709df57ed1c637df7 Email: sosa@chromium.org We should not update the hostkey for vm testing. BUG=chromium-os:19198 TEST=Ran Change-Id: I40dceb66f14036b22e5ccc8fe665aad4a049ca64 Reviewed-on: http://gerrit.chromium.org/gerrit/8401 Reviewed-by: Chris Sosa <sosa@chromium.org> Tested-by: Chris Sosa <sosa@chromium.org> M cros_run_vm_update
Sep 28, 2011
Fix submitted in cros_run_vm_update. If you see again, please re-open
Status:
Fixed
Oct 10, 2011
(No comment was entered for this change.)
Status:
Verified
Nov 8, 2011
(No comment was entered for this change.)
Labels:
FixedIn-1098.0.0
Jan 9, 2012
(No comment was entered for this change.)
Labels:
-FixedIn-1098.0.0 FixedIn-1099.0.0
Jan 20, 2012
(No comment was entered for this change.)
Labels:
FixedInIndex-25
Mar 6, 2013
(No comment was entered for this change.)
Labels:
OS-Chrome
Mar 9, 2013
(No comment was entered for this change.)
Labels:
-Area-Test -TreeCloser -Mstone-R16 Cr-Test Hotlist-TreeCloser M-16
|
||||||||||||
| ► Sign in to add a comment | |||||||||||||