| Issue 203394: | Test Timeout on zgb canary. | |
| 4 people starred this issue and may be notified of changes. | Back to list |
Sign in to add a comment
|
OK
Process Pool Active: Waiting on 4/4 jobs to complete
Process Pool Active: Waiting on 3/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Traceback (most recent call last):
File "bin/cros_au_test_harness", line 307, in <module>
main()
File "bin/cros_au_test_harness", line 294, in main
_RunTestsInParallel(options)
File "bin/cros_au_test_harness", line 181, in _RunTestsInParallel
results = parallel_test_job.RunParallelJobs(options.jobs, threads, args)
File "/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 149, in RunParallelJobs
ParallelJob.WaitUntilJobsComplete(parallel_jobs)
File "/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 91, in WaitUntilJobsComplete
'job completion.' % cls.MAX_TIMEOUT_SECONDS)
parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.
Traceback (most recent call last):
File "bin/ctest", line 363, in <module>
main()
File "bin/ctest", line 359, in main
options.test_results_root)
File "bin/ctest", line 322, in RunAUTestHarness
cros_lib.RunCommand(cmd, cwd=crosutils_root)
File "/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/ctest/../../../../chromite/lib/cros_build_lib.py", line 177, in RunCommand
raise RunCommandError(msg, cmd, proc.returncode)
chromite.lib.cros_build_lib.RunCommandError: ('Command "[\'bin/cros_au_test_harness\', \'--base_image=/b/cbuild/x86-zgb-canary-master/src/scripts/latest_download/chromiumos_test_image.bin\', \'--target_image=/b/cbuild/x86-zgb-canary-master/src/scripts/ctest_target/chromiumos_test_image.bin\', \'--board=x86-zgb\', \'--type=vm\', \'--remote=0.0.0.0\', \'--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness\', \'--no_graphics\', \'--clean\', \'--private_key=/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/ctest/../../../../src/scripts/../platform/update_engine/unittest_key.pem\', \'--public_key=public_key.pem\']" with extra env None\n', ['bin/cros_au_test_harness', '--base_image=/b/cbuild/x86-zgb-canary-master/src/scripts/latest_download/chromiumos_test_image.bin', '--target_image=/b/cbuild/x86-zgb-canary-master/src/scripts/ctest_target/chromiumos_test_image.bin', '--board=x86-zgb', '--type=vm', '--remote=0.0.0.0', '--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness', '--no_graphics', '--clean', '--private_key=/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/ctest/../../../../src/scripts/../platform/update_engine/unittest_key.pem', '--public_key=public_key.pem'], 1)
INFO: PROGRAM(cbuildbot) -> RunCommand: ['tar', 'czf', '/b/cbuild/x86-zgb-canary-master/test_results.tgz', '--directory=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK', '.'] in dir None
@@@STEP_FAILURE@@@
Traceback (most recent call last):
File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_stages.py", line 252, in Run
self._PerformStage()
File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_stages.py", line 780, in _PerformStage
full=(not self._build_config['quick_vm']))
File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_commands.py", line 338, in RunTestSuite
cros_lib.OldRunCommand(cmd, cwd=cwd, error_ok=False)
File "/b/build/slave/x86-zgb-canary-master/build/chromite/lib/cros_build_lib.py", line 669, in OldRunCommand
raise e
RunCommandException: ('Command "[\'bin/ctest\', \'--board=x86-zgb\', \'--channel=dev-channel\', \'--zipbase=http://chromeos-images.corp.google.com\' \'--type=vm\', \'--no_graphics\', \'--target_image=/b/cbuild/x86-zgb-canary-master/src/build/images/x86-zgb/latest-cbuildbot/chromiumos_test_image.bin\', \'--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness\']" failed.\n', ['bin/ctest', '--board=x86-zgb', '--channel=dev-channel', '--zipbase=http://chromeos-images.corp.google.com' '--type=vm', '--no_graphics', '--target_image=/b/cbuild/x86-zgb-canary-master/src/build/images/x86-zgb/latest-cbuildbot/chromiumos_test_image.bin', '--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness'])
************************************************************
** Finished Stage VMTest - 19:17:32
************************************************************
Traceback (most recent call last):
File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_background.py", line 96, in run
step()
File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_stages.py", line 256, in Run
raise BuildException()
BuildException
--------------------------------------------------------------------------------
started: Sun Sep 11 16:30:41 2011
ended: Sun Sep 11 19:15:52 2011
duration: 2 hrs, 45 mins, 11 secs
Mar 16, 2013
#1
bugdroid1@chromium.org
Blockedon:
chromium:98257 chromium-os:20955
Sep 13, 2011
It's hard to tell from the output what process did not respond within the timeout period.
Owner:
davidjames@chromium.org
Sep 19, 2011
Another failure. parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion. From test_harness/testUpdateWipeStateful/1_update/update.log in http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex-binary/builds/1425: INFO : Starting update using args --update --omaha_url http://172.22.15.176:8080/update/cache/2ef034326c8cae81a7e386a11c4258af_df0e900d406494bc7191ae0d27e7a06b+207bb906877a2667a7c736e1e4cad4b8 INFO : Devserver handling ping. Check dev_server.log for more info. [0919/172247:INFO:update_engine_client.cc(280)] Forcing an update by setting app_version to ForcedUpdate. [0919/172247:INFO:update_engine_client.cc(282)] Initiating update check and install. [0919/172247:INFO:update_engine_client.cc(288)] Waiting for update to complete. INFO : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE ........%
Sep 20, 2011
Same failure: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex-binary/builds/1432/steps/VMTest/logs/stdio parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion. from [test_results.tgz]/test_harness/testUpdateKeepStateful/1_update/update.log Connection OK INFO : Forcing update independent of the current version INFO : Waiting to initiate contact with the update_engine. INFO : Starting update using args --update --omaha_url http://172.22.15.176:8080/update/cache/b6e4bfb2353bee938db25ec4259589b0_cc7a8a1be5451004e72d3362fd4ea78f+207bb906877a2667a7c736e1e4cad4b8 INFO : Devserver handling ping. Check dev_server.log for more info. [0920/050730:INFO:update_engine_client.cc(280)] Forcing an update by setting app_version to ForcedUpdate. [0920/050730:INFO:update_engine_client.cc(282)] Initiating update check and install. [0920/050730:INFO:update_engine_client.cc(288)] Waiting for update to complete. INFO : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE ......
Sep 20, 2011
VMtests failed due to timeout in testUpdateWipeStateFul. Update log is: Starting a KVM instance pci_add_option_rom: failed to find romfile "pxe-virtio.bin" INFO : KVM started with pid stored in /tmp/kvm.9223 Initiating first contact with remote host Warning: Permanently added '[127.0.0.1]:9223' (RSA) to the list of known hosts. Connection OK WARNING : Verify is not compatible with setting an update url. Initiating first contact with remote host Warning: Permanently added '[127.0.0.1]:9223' (RSA) to the list of known hosts. Connection OK INFO : Forcing update independent of the current version INFO : Waiting to initiate contact with the update_engine. INFO : Starting update using args --update --omaha_url http://172.22.15.182:8080/update/cache/7ef8cdeb11b4ab586fefffa02016e7c3_a11c5e3410d1d7352ae972b5d4d7682b+207bb906877a2667a7c736e1e4cad4b8 INFO : Devserver handling ping. Check dev_server.log for more info. [0920/085950:INFO:update_engine_client.cc(280)] Forcing an update by setting app_version to ForcedUpdate. [0920/085950:INFO:update_engine_client.cc(282)] Initiating update check and install. [0920/085950:INFO:update_engine_client.cc(288)] Waiting for update to complete. INFO : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE .........[0920/090146:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. INFO : State of updater has changed to: (null) INFO : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE ...INFO : State of updater has changed to: UPDATE_STATUS_DOWNLOADING Download progress 0.224391 Download progress 0.254483 Download progress 0.314925 Download progress 0.642296 Download progress 0.915595 [0920/090404:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. INFO : State of updater has changed to: UPDATE_STATUS_FINALIZING ...... Logs at http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy-binary/builds/769
Sep 20, 2011
Issue 20626 has been merged into this issue.
Sep 20, 2011
We're hoping to reduce the incidence of this by turning off parallelism and hoping it won't nuke cycle time.
Status:
Assigned
Labels: -Area-Undefined Area-Test Mstone-R16
Sep 20, 2011
Commit: fc2179d01f943a1d89f1b3a3740b82a19c4bb0e2 Email: davidjames@chromium.org Reserve more RAM for background processes to be more conservative. We've seen a lot of timeouts during tests on our buildbots. These timeouts might be related to our buildbots being overloaded by so many jobs running in parallel. Let's try allocating more RAM to background processes in hopes of reducing flaky failures. BUG=chromium-os:20305 TEST=Verify this limits jobs to 2 jobs on machines with 12GB RAM. On machines with 64GB RAM we can run more jobs without slowing the tests. Change-Id: I806f692351e75b592484050745859541162828fc Reviewed-on: http://gerrit.chromium.org/gerrit/7991 Reviewed-by: Matt Tennant <mtennant@chromium.org> Tested-by: David James <davidjames@chromium.org> M au_test_harness/cros_au_test_harness.py
Sep 21, 2011
http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/351/steps/VMTest/logs/stdio Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Traceback (most recent call last): File "bin/cros_au_test_harness", line 323, in <module> main() File "bin/cros_au_test_harness", line 310, in main _RunTestsInParallel(options) File "bin/cros_au_test_harness", line 181, in _RunTestsInParallel results = parallel_test_job.RunParallelJobs(options.jobs, threads, args) File "/b/cbuild/stumpy-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 149, in RunParallelJobs ParallelJob.WaitUntilJobsComplete(parallel_jobs) File "/b/cbuild/stumpy-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 91, in WaitUntilJobsComplete 'job completion.' % cls.MAX_TIMEOUT_SECONDS) parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.
Sep 21, 2011
P0 until we at least get the de-parallelizing change in.
Labels:
-Pri-1 Pri-0
Sep 21, 2011
So the CL to reduce parallelism to 2 was pushed yesterday. We could reduce jobs further, but based on logs from stumpy canary and stumpy binary, I don't think the machine is running out of RAM or being overloaded anymore with two jobs. Two jobs seems decent for our hardware and the jobs are completing at a reasonably fast pace. We don't see any more errors where the connection times out during banner exchange. We do see the following error: [0920/090146:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. Adlr could you investigate the above error? One bug I noticed in the update engine is here: GetStatus function in update_engine_client.cc seems to return successful even if it fails to connect.
Owner:
adlr@chromium.org
Sep 21, 2011
Commit: f975b7fc8e60a55ec733d0947975669637ae8b3b Email: adlr@chromium.org image_to_live.sh: get updater log in real time We are seeing hangs of image_to_live.sh, which may be due to update_engine hangs. We currently don't get the update engine log if this process is killed, so this CL changes that. We now use tail -f to get the log in real time. BUG=chromium-os:20305 TEST=tested image_to_live.sh by hand Change-Id: I1a0e635df3c7a61a75146a97c42b95b4a9313218 Reviewed-on: http://gerrit.chromium.org/gerrit/8083 Reviewed-by: David James <davidjames@chromium.org> Tested-by: Andrew de los Reyes <adlr@chromium.org> M host/image_to_live.sh
Sep 21, 2011
Commit: e08639b8597b4de81e65eee46e595a52637f08fe Email: adlr@chromium.org remote_access.sh: add remote_sh_raw for a non-buffered ssh connection. BUG=chromium-os:20305 TEST=manual test of image_to_live.sh Change-Id: I964088673d0dc2867e911368d9cc4ec73085a4f4 Reviewed-on: http://gerrit.chromium.org/gerrit/8082 Reviewed-by: David James <davidjames@chromium.org> Tested-by: Andrew de los Reyes <adlr@chromium.org> M remote_access.sh
Sep 23, 2011
Saw one again. http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-zgb%20canary/builds/830/steps/VMTest/logs/stdio This didn't have logs because the ssh logging process wasn't interrupted properly when the builder got killed. adlr is working on a fix
Sep 25, 2011
Another one on alex-binary: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex%20canary/builds/1047/steps/VMTest/logs/stdio Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 2/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Process Pool Active: Waiting on 1/4 jobs to complete Traceback (most recent call last): File "bin/cros_au_test_harness", line 323, in <module> main() File "bin/cros_au_test_harness", line 310, in main _RunTestsInParallel(options) File "bin/cros_au_test_harness", line 181, in _RunTestsInParallel results = parallel_test_job.RunParallelJobs(options.jobs, threads, args) File "/b/cbuild/x86-alex-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 149, in RunParallelJobs ParallelJob.WaitUntilJobsComplete(parallel_jobs) File "/b/cbuild/x86-alex-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 91, in WaitUntilJobsComplete 'job completion.' % cls.MAX_TIMEOUT_SECONDS) parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.
Sep 26, 2011
Hm, it appears the updater is stuck waiting for /bin/sync to run. I'm not sure why this would be problematic...
Sep 27, 2011
Another case on the commit queue bot http://chromegw.corp.google.com/i/chromiumos/builders/x86%20generic%20commit%20queue/builds/6670/steps/VMTest/logs/stdio Full logs: http://www/~nkostylev/no_crawl/cros/20305/
Sep 27, 2011
Update.log attached. This looks like maybe a different issue from the earlier issue where updater was stuck on /bin/sync ... Note that last update to update.log is at 3:01am but job was not killed until 04:38am.
Sep 27, 2011
(No comment was entered for this change.)
Blockedon:
chromium:98257
Sep 27, 2011
Commit: 8e93647f050f78ffa8a8aa674694269aefbb12c5 Email: adlr@chromium.org kvm: set cache type to unsafe if possible, else writeback We try to set the cache type to unsafe, which is a new kvm feature, else we set the cache type to writeback. The hope is that these cache types will perform better on the bots, especially in the case of the guest trying to flush data to disk. The CL includes code to determine which version of kvm is running, and the cache type is chosen based on that. BUG=chromium-os:20305 TEST=trybot Change-Id: Icf7dbb8f0056a330282ebeb81d0902b3d54f34c5 Reviewed-on: http://gerrit.chromium.org/gerrit/8360 Reviewed-by: David James <davidjames@chromium.org> Tested-by: Andrew de los Reyes <adlr@chromium.org> M lib/cros_vm_lib.sh
Sep 28, 2011
Thanks to quick work by adlr and pschmidt, this bug should be fixed on external bots. Internal bots may still suffer from the problem because they don't have qemu 0.14.1 yet. I've filed Bug 20955 for fixing the internal bots and hence am marking it as a blocker.
Blockedon:
chromium-os:20955
Sep 29, 2011
Issue 20276 has been merged into this issue.
Cc:
r...@chromium.org davidjames@chromium.org
Sep 29, 2011
Even with adlr's fix, this problem is still happening on internal bots thanks to Bug 20955 (kvm has not been updated yet). Example failure: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/378 INFO : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE .......[0928/012521:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. INFO : State of updater has changed to: (null) [0928/012546:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0928/012611:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [0928/012636:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0928/012701:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Sep 29, 2011
This is still happening even on external bots. See http://build.chromium.org/p/chromiumos/builders/x86%20generic%20commit%20queue/builds/6969/steps/VMTest/logs/stdio Here is the log directory for the test_results.tgz in case the above gets wiped: https://sandbox.google.com/storage/chromeos-image-archive/x86-generic-commit-queue/R16-1103.0.0-a1-b6969 INFO : State of updater has changed to: UPDATE_STATUS_DOWNLOADING Download progress 0.110595 Download progress 0.201129 Download progress 0.249235 Download progress 0.339991 Download progress 0.420662 Download progress 0.501333 Download progress 0.602173 Download progress 0.682844 [0929/111156:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. INFO : State of updater has changed to: (null) [0929/111221:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0929/111246:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0929/111311:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0929/111336:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0929/111401:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. [0929/111426:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0929/111451:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. .[0929/111516:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
Sep 29, 2011
Caught this hang in action on the commit queue: http://build.chromium.org/p/chromiumos/builders/x86%20generic%20commit%20queue/builds/6971 I stopped the processes in question so I could grab a snapshot. pstree attached as update-pstree.txt $ ps -eo pid,ppid,etime,cmd | grep -E 'ssh -p|kvm ' | grep -v 'color=auto' 22842 22841 01:15:17 ssh -p 9226 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/image_to_live.NXnn/known_hosts -o ConnectTimeout=120 -i /tmp/image_to_live.NXnn/private_key root@127.0.0.1 /usr/bin/update_engine_client --status 2> /dev/null |? grep CURRENT_OP |? cut -f 2 -d = 26720 1 01:18:59 kvm -m 1024 -vga std -pidfile /tmp/kvm.9226 -daemonize -net nic,model=virtio -nographic -serial none -snapshot -net user,hostfwd=tcp::9226-:22 -drive file=/b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_qemu_image.bin,index=0,media=disk,cache=unsafe 31520 30825 01:18:38 ssh -p 9226 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/image_to_live.NXnn/known_hosts -o ConnectTimeout=120 -i /tmp/image_to_live.NXnn/private_key root@127.0.0.1 /usr/bin/update_engine_client --update --omaha_url http://192.168.120.13:8080/update/cache/588b8f5fe617a7e1799d2435a2fc8813 31522 31519 01:18:38 ssh -p 9226 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/image_to_live.NXnn/known_hosts -o ConnectTimeout=120 -i /tmp/image_to_live.NXnn/private_key -t -t root@127.0.0.1 tail -n +0 -f /var/log/update_engine.log I ssh'd into the kvm in question and found that it had only been up for 1h15m, so this suggested that the update_engine_client was still hanging even after the machine had rebooted. Running update_engine_client on the machine, I get the following: $ /usr/bin/update_engine_client --status [0929/151438:INFO:update_engine_client.cc(253)] Querying Update Engine status... LAST_CHECKED_TIME=0 PROGRESS=0.000000 CURRENT_OP=UPDATE_STATUS_IDLE NEW_VERSION=0.0.0.0 NEW_SIZE=0
Sep 30, 2011
Commit: f5850903d901db2ffe66dee7244465f845a0b2c2 Email: davidjames@chromium.org Tweak ssh settings to detect when remote server dies. Right now, it seems that we don't consistently detect when the kvm is rebooted. To improve this, I've added ServerKeepAlive messages, such that our connection will be dropped if the target host is unreachable for 15 seconds. Further, I've also tightened the ConnectTimeout from 120 seconds to 30 seconds so that we detect timeouts earlier, but added 4 retries, so that all connections are robust to temporary unresponsiveness (e.g. ssh server restarting). BUG=chromium-os:21082, chromium-os:20305 TEST=alex-binary trybot run. Change-Id: Ibd000bc3158138ee9c617e56bd7b11d4bd8bcb61 Reviewed-on: http://gerrit.chromium.org/gerrit/8574 Reviewed-by: Chris Sosa <sosa@chromium.org> Tested-by: David James <davidjames@chromium.org> Commit-Ready: David James <davidjames@chromium.org> M remote_access.sh
Oct 7, 2011
Commit: e59822eb23bfa5741d4b0b9bd478fb1355ed7ffc Email: davidjames@chromium.org Reserve more RAM for background processes to be more conservative again. On machines with 12GB RAM or less, we need to limit the number of parallel VMs to two to prevent the machine from becoming so slow that timeouts occur. This CL was pushed on Sep 20, and seemed to fix the banner timeout issues. Then, this CL was accidentally reverted on Oct 3 and the timeouts started occurring again. BUG=chromium-os:20305, chromium-os:21351, chromium-os:21389 TEST=Verify this limits jobs to 2 jobs on machines with 12GB RAM. On machines with more RAM we can run more jobs without slowing the tests. Change-Id: I437ffa97847f6acb693e7b4295f85ef2af1d366a Reviewed-on: http://gerrit.chromium.org/gerrit/9734 Reviewed-by: Darin Petkov <petkov@chromium.org> Tested-by: David James <davidjames@chromium.org> M lib/test_helper.py
Oct 11, 2011
(No comment was entered for this change.)
Status:
Fixed
Oct 17, 2011
Claiming work for iteration-40
Labels:
Iteration-40
Oct 19, 2011
(No comment was entered for this change.)
Labels:
VerifyIn-R16
Oct 19, 2011
(No comment was entered for this change.)
Labels:
-VerifyIn-R16
Oct 24, 2011
(No comment was entered for this change.)
Status:
Verified
Nov 8, 2011
(No comment was entered for this change.)
Labels:
FixedIn-1111.0.0
Nov 17, 2011
Looks like we have hit this again on x86-alex-binary:
From http://chromegw/i/chromeos/builders/x86-alex-binary/builds/2028/steps/cbuildbot/logs/stdio :
Ran 1 test in 801.039s
OK
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Traceback (most recent call last):
File "bin/cros_au_test_harness", line 207, in <module>
main()
File "bin/cros_au_test_harness", line 194, in main
_RunTestsInParallel(options)
File "bin/cros_au_test_harness", line 92, in _RunTestsInParallel
results = parallel_test_job.RunParallelJobs(options.jobs, threads, args)
File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/../../crostestutils/lib/parallel_test_job.py", line 149, in RunParallelJobs
ParallelJob.WaitUntilJobsComplete(parallel_jobs)
File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/../../crostestutils/lib/parallel_test_job.py", line 91, in WaitUntilJobsComplete
'job completion.' % cls.MAX_TIMEOUT_SECONDS)
crostestutils.lib.parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.
@@@STEP_FAILURE@@@
Jan 20, 2012
(No comment was entered for this change.)
Labels:
FixedInIndex-26
Mar 6, 2013
(No comment was entered for this change.)
Labels:
OS-Chrome
Blockedon: -chromium:98257 -chromium-os:20955 chromium:98257 chromium-os:20955
Mar 9, 2013
(No comment was entered for this change.)
Labels:
-Area-Test -TreeCloser -Mstone-R16 Cr-Test Hotlist-TreeCloser M-16
|
||||||||||||
| ► Sign in to add a comment | |||||||||||||