My favorites | Sign in
Project Home Downloads Wiki Issues Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 203394: Test Timeout on zgb canary.
4 people starred this issue and may be notified of changes. Back to list
 
Project Member Reported by petermayo@chromium.org, Sep 11, 2011


OK
Process Pool Active: Waiting on 4/4 jobs to complete
Process Pool Active: Waiting on 3/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Traceback (most recent call last):
  File "bin/cros_au_test_harness", line 307, in <module>
    main()
  File "bin/cros_au_test_harness", line 294, in main
    _RunTestsInParallel(options)
  File "bin/cros_au_test_harness", line 181, in _RunTestsInParallel
    results = parallel_test_job.RunParallelJobs(options.jobs, threads, args)
  File "/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 149, in RunParallelJobs
    ParallelJob.WaitUntilJobsComplete(parallel_jobs)
  File "/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 91, in WaitUntilJobsComplete
    'job completion.' % cls.MAX_TIMEOUT_SECONDS)
parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.
Traceback (most recent call last):
  File "bin/ctest", line 363, in <module>
    main()
  File "bin/ctest", line 359, in main
    options.test_results_root)
  File "bin/ctest", line 322, in RunAUTestHarness
    cros_lib.RunCommand(cmd, cwd=crosutils_root)
  File "/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/ctest/../../../../chromite/lib/cros_build_lib.py", line 177, in RunCommand
    raise RunCommandError(msg, cmd, proc.returncode)
chromite.lib.cros_build_lib.RunCommandError: ('Command "[\'bin/cros_au_test_harness\', \'--base_image=/b/cbuild/x86-zgb-canary-master/src/scripts/latest_download/chromiumos_test_image.bin\', \'--target_image=/b/cbuild/x86-zgb-canary-master/src/scripts/ctest_target/chromiumos_test_image.bin\', \'--board=x86-zgb\', \'--type=vm\', \'--remote=0.0.0.0\', \'--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness\', \'--no_graphics\', \'--clean\', \'--private_key=/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/ctest/../../../../src/scripts/../platform/update_engine/unittest_key.pem\', \'--public_key=public_key.pem\']" with extra env None\n', ['bin/cros_au_test_harness', '--base_image=/b/cbuild/x86-zgb-canary-master/src/scripts/latest_download/chromiumos_test_image.bin', '--target_image=/b/cbuild/x86-zgb-canary-master/src/scripts/ctest_target/chromiumos_test_image.bin', '--board=x86-zgb', '--type=vm', '--remote=0.0.0.0', '--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness', '--no_graphics', '--clean', '--private_key=/b/cbuild/x86-zgb-canary-master/src/platform/crostestutils/ctest/../../../../src/scripts/../platform/update_engine/unittest_key.pem', '--public_key=public_key.pem'], 1)

INFO: PROGRAM(cbuildbot) -> RunCommand: ['tar', 'czf', '/b/cbuild/x86-zgb-canary-master/test_results.tgz', '--directory=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK', '.'] in dir None
@@@STEP_FAILURE@@@
Traceback (most recent call last):
  File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_stages.py", line 252, in Run
    self._PerformStage()
  File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_stages.py", line 780, in _PerformStage
    full=(not self._build_config['quick_vm']))
  File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_commands.py", line 338, in RunTestSuite
    cros_lib.OldRunCommand(cmd, cwd=cwd, error_ok=False)
  File "/b/build/slave/x86-zgb-canary-master/build/chromite/lib/cros_build_lib.py", line 669, in OldRunCommand
    raise e
RunCommandException: ('Command "[\'bin/ctest\', \'--board=x86-zgb\', \'--channel=dev-channel\', \'--zipbase=http://chromeos-images.corp.google.com\' \'--type=vm\', \'--no_graphics\', \'--target_image=/b/cbuild/x86-zgb-canary-master/src/build/images/x86-zgb/latest-cbuildbot/chromiumos_test_image.bin\', \'--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness\']" failed.\n', ['bin/ctest', '--board=x86-zgb', '--channel=dev-channel', '--zipbase=http://chromeos-images.corp.google.com' '--type=vm', '--no_graphics', '--target_image=/b/cbuild/x86-zgb-canary-master/src/build/images/x86-zgb/latest-cbuildbot/chromiumos_test_image.bin', '--test_results_root=/b/cbuild/x86-zgb-canary-master/chroot/tmp/cbuildbotJzifjK/test_harness'])

************************************************************
** Finished Stage VMTest - 19:17:32
************************************************************
Traceback (most recent call last):
  File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_background.py", line 96, in run
    step()
  File "/b/build/slave/x86-zgb-canary-master/build/chromite/buildbot/cbuildbot_stages.py", line 256, in Run
    raise BuildException()
BuildException


--------------------------------------------------------------------------------
started: Sun Sep 11 16:30:41 2011
ended: Sun Sep 11 19:15:52 2011
duration: 2 hrs, 45 mins, 11 secs


test_results (1).tgz
4.5 MB   Download
Mar 16, 2013
#1 bugdroid1@chromium.org
(No comment was entered for this change.)
Blockedon: chromium:98257 chromium-os:20955
Sep 13, 2011
#2 thieule@chromium.org
It's hard to tell from the output what process did not respond within the timeout period.
Owner: davidjames@chromium.org
Sep 19, 2011
#3 davidjames@google.com
Another failure.

parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.

From test_harness/testUpdateWipeStateful/1_update/update.log in http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex-binary/builds/1425:


INFO    : Starting update using args --update --omaha_url http://172.22.15.176:8080/update/cache/2ef034326c8cae81a7e386a11c4258af_df0e900d406494bc7191ae0d27e7a06b+207bb906877a2667a7c736e1e4cad4b8
INFO    : Devserver handling ping.  Check dev_server.log for more info.
[0919/172247:INFO:update_engine_client.cc(280)] Forcing an update by setting app_version to ForcedUpdate.
[0919/172247:INFO:update_engine_client.cc(282)] Initiating update check and install.
[0919/172247:INFO:update_engine_client.cc(288)] Waiting for update to complete.
INFO    : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE
........%                                                                                                                                                                                                                                  
Sep 20, 2011
#4 rongch...@chromium.org

Same failure: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex-binary/builds/1432/steps/VMTest/logs/stdio

parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.

from [test_results.tgz]/test_harness/testUpdateKeepStateful/1_update/update.log

Connection OK
INFO    : Forcing update independent of the current version
INFO    : Waiting to initiate contact with the update_engine.
INFO    : Starting update using args --update --omaha_url http://172.22.15.176:8080/update/cache/b6e4bfb2353bee938db25ec4259589b0_cc7a8a1be5451004e72d3362fd4ea78f+207bb906877a2667a7c736e1e4cad4b8
INFO    : Devserver handling ping.  Check dev_server.log for more info.
[0920/050730:INFO:update_engine_client.cc(280)] Forcing an update by setting app_version to ForcedUpdate.
[0920/050730:INFO:update_engine_client.cc(282)] Initiating update check and install.
[0920/050730:INFO:update_engine_client.cc(288)] Waiting for update to complete.
INFO    : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE
......
Sep 20, 2011
#5 cmasone@chromium.org
VMtests failed due to timeout in testUpdateWipeStateFul. Update log is:

Starting a KVM instance
pci_add_option_rom: failed to find romfile "pxe-virtio.bin"
INFO    : KVM started with pid stored in /tmp/kvm.9223
Initiating first contact with remote host
Warning: Permanently added '[127.0.0.1]:9223' (RSA) to the list of known hosts.
Connection OK
WARNING : Verify is not compatible with setting an update url.
Initiating first contact with remote host
Warning: Permanently added '[127.0.0.1]:9223' (RSA) to the list of known hosts.
Connection OK
INFO    : Forcing update independent of the current version
INFO    : Waiting to initiate contact with the update_engine.
INFO    : Starting update using args --update --omaha_url http://172.22.15.182:8080/update/cache/7ef8cdeb11b4ab586fefffa02016e7c3_a11c5e3410d1d7352ae972b5d4d7682b+207bb906877a2667a7c736e1e4cad4b8
INFO    : Devserver handling ping.  Check dev_server.log for more info.
[0920/085950:INFO:update_engine_client.cc(280)] Forcing an update by setting app_version to ForcedUpdate.
[0920/085950:INFO:update_engine_client.cc(282)] Initiating update check and install.
[0920/085950:INFO:update_engine_client.cc(288)] Waiting for update to complete.
INFO    : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE
.........[0920/090146:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
INFO    : State of updater has changed to: (null)
INFO    : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE
...INFO    : State of updater has changed to: UPDATE_STATUS_DOWNLOADING
Download progress 0.224391
Download progress 0.254483
Download progress 0.314925
Download progress 0.642296
Download progress 0.915595
[0920/090404:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
INFO    : State of updater has changed to: UPDATE_STATUS_FINALIZING
......


Logs at http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy-binary/builds/769
Sep 20, 2011
#6 cmasone@chromium.org
 Issue 20626  has been merged into this issue.
Sep 20, 2011
#7 cmasone@chromium.org
We're hoping to reduce the incidence of this by turning off parallelism and hoping it won't nuke cycle time.
Status: Assigned
Labels: -Area-Undefined Area-Test Mstone-R16
Sep 20, 2011
#8 bugdroid1@chromium.org
Commit: fc2179d01f943a1d89f1b3a3740b82a19c4bb0e2
 Email: davidjames@chromium.org

Reserve more RAM for background processes to be more conservative.

We've seen a lot of timeouts during tests on our buildbots. These timeouts
might be related to our buildbots being overloaded by so many jobs running
in parallel. Let's try allocating more RAM to background processes in hopes
of reducing flaky failures.

BUG=chromium-os:20305
TEST=Verify this limits jobs to 2 jobs on machines with 12GB RAM. On machines
     with 64GB RAM we can run more jobs without slowing the tests.

Change-Id: I806f692351e75b592484050745859541162828fc
Reviewed-on: http://gerrit.chromium.org/gerrit/7991
Reviewed-by: Matt Tennant <mtennant@chromium.org>
Tested-by: David James <davidjames@chromium.org>

M	au_test_harness/cros_au_test_harness.py
Sep 21, 2011
#9 cmasone@chromium.org
http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/351/steps/VMTest/logs/stdio

Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Traceback (most recent call last):
  File "bin/cros_au_test_harness", line 323, in <module>
    main()
  File "bin/cros_au_test_harness", line 310, in main
    _RunTestsInParallel(options)
  File "bin/cros_au_test_harness", line 181, in _RunTestsInParallel
    results = parallel_test_job.RunParallelJobs(options.jobs, threads, args)
  File "/b/cbuild/stumpy-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 149, in RunParallelJobs
    ParallelJob.WaitUntilJobsComplete(parallel_jobs)
  File "/b/cbuild/stumpy-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 91, in WaitUntilJobsComplete
    'job completion.' % cls.MAX_TIMEOUT_SECONDS)
parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.

Sep 21, 2011
#10 cmasone@chromium.org
P0 until we at least get the de-parallelizing change in.
Labels: -Pri-1 Pri-0
Sep 21, 2011
#11 davidjames@google.com
So the CL to reduce parallelism to 2 was pushed yesterday. We could reduce jobs further, but based on logs from stumpy canary and stumpy binary, I don't think the machine is running out of RAM or being overloaded anymore with two jobs. Two jobs seems decent for our hardware and the jobs are completing at a reasonably fast pace. We don't see any more errors where the connection times out during banner exchange.

We do see the following error:

[0920/090146:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Adlr could you investigate the above error?

One bug I noticed in the update engine is here: GetStatus function in update_engine_client.cc seems to return successful even if it fails to connect.
Owner: adlr@chromium.org
Sep 21, 2011
#12 bugdroid1@chromium.org
Commit: f975b7fc8e60a55ec733d0947975669637ae8b3b
 Email: adlr@chromium.org

image_to_live.sh: get updater log in real time

We are seeing hangs of image_to_live.sh, which may be due to
update_engine hangs. We currently don't get the update engine log if
this process is killed, so this CL changes that. We now use tail -f to
get the log in real time.

BUG=chromium-os:20305
TEST=tested image_to_live.sh by hand

Change-Id: I1a0e635df3c7a61a75146a97c42b95b4a9313218
Reviewed-on: http://gerrit.chromium.org/gerrit/8083
Reviewed-by: David James <davidjames@chromium.org>
Tested-by: Andrew de los Reyes <adlr@chromium.org>

M	host/image_to_live.sh
Sep 21, 2011
#13 bugdroid1@chromium.org
Commit: e08639b8597b4de81e65eee46e595a52637f08fe
 Email: adlr@chromium.org

remote_access.sh: add remote_sh_raw for a non-buffered ssh connection.

BUG=chromium-os:20305
TEST=manual test of image_to_live.sh

Change-Id: I964088673d0dc2867e911368d9cc4ec73085a4f4
Reviewed-on: http://gerrit.chromium.org/gerrit/8082
Reviewed-by: David James <davidjames@chromium.org>
Tested-by: Andrew de los Reyes <adlr@chromium.org>

M	remote_access.sh
Sep 23, 2011
#14 davidjames@google.com
Saw one again. http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-zgb%20canary/builds/830/steps/VMTest/logs/stdio

This didn't have logs because the ssh logging process wasn't interrupted properly when the builder got killed. adlr is working on a fix
Sep 25, 2011
#15 gauravsh@chromium.org
Another one on alex-binary: 

http://chromeos-botmaster.mtv.corp.google.com:8026/builders/x86-alex%20canary/builds/1047/steps/VMTest/logs/stdio

Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 2/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Process Pool Active: Waiting on 1/4 jobs to complete
Traceback (most recent call last):
  File "bin/cros_au_test_harness", line 323, in <module>
    main()
  File "bin/cros_au_test_harness", line 310, in main
    _RunTestsInParallel(options)
  File "bin/cros_au_test_harness", line 181, in _RunTestsInParallel
    results = parallel_test_job.RunParallelJobs(options.jobs, threads, args)
  File "/b/cbuild/x86-alex-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 149, in RunParallelJobs
    ParallelJob.WaitUntilJobsComplete(parallel_jobs)
  File "/b/cbuild/x86-alex-canary-master/src/platform/crostestutils/au_test_harness/parallel_test_job.py", line 91, in WaitUntilJobsComplete
    'job completion.' % cls.MAX_TIMEOUT_SECONDS)
parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.

Sep 26, 2011
#16 adlr@chromium.org
Hm, it appears the updater is stuck waiting for /bin/sync to run. I'm not sure why this would be problematic...
Sep 27, 2011
#17 nkostylev@chromium.org
Another case on the commit queue bot

http://chromegw.corp.google.com/i/chromiumos/builders/x86%20generic%20commit%20queue/builds/6670/steps/VMTest/logs/stdio

Full logs:
http://www/~nkostylev/no_crawl/cros/20305/
Sep 27, 2011
#18 davidjames@google.com
Update.log attached. This looks like maybe a different issue from the earlier issue where updater was stuck on /bin/sync ...

Note that last update to update.log is at 3:01am but job was not killed until 04:38am. 
update.log
11.9 KB   View   Download
Sep 27, 2011
#19 davidjames@google.com
(No comment was entered for this change.)
Blockedon: chromium:98257
Sep 27, 2011
#20 bugdroid1@chromium.org
Commit: 8e93647f050f78ffa8a8aa674694269aefbb12c5
 Email: adlr@chromium.org

kvm: set cache type to unsafe if possible, else writeback

We try to set the cache type to unsafe, which is a new kvm feature,
else we set the cache type to writeback. The hope is that these cache
types will perform better on the bots, especially in the case of the
guest trying to flush data to disk.

The CL includes code to determine which version of kvm is running, and
the cache type is chosen based on that.

BUG=chromium-os:20305
TEST=trybot

Change-Id: Icf7dbb8f0056a330282ebeb81d0902b3d54f34c5
Reviewed-on: http://gerrit.chromium.org/gerrit/8360
Reviewed-by: David James <davidjames@chromium.org>
Tested-by: Andrew de los Reyes <adlr@chromium.org>

M	lib/cros_vm_lib.sh
Sep 28, 2011
#21 davidjames@chromium.org
Thanks to quick work by adlr and pschmidt, this bug should be fixed on external bots. Internal bots may still suffer from the problem because they don't have qemu 0.14.1 yet.

I've filed  Bug 20955  for fixing the internal bots and hence am marking it as a blocker.
Blockedon: chromium-os:20955
Sep 29, 2011
#22 davidjames@chromium.org
Issue 20276 has been merged into this issue.
Cc: r...@chromium.org davidjames@chromium.org
Sep 29, 2011
#23 davidjames@chromium.org
Even with adlr's fix, this problem is still happening on internal bots thanks to  Bug 20955  (kvm has not been updated yet). Example failure: http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/378

INFO    : State of updater has changed to: UPDATE_STATUS_UPDATE_AVAILABLE
.......[0928/012521:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
INFO    : State of updater has changed to: (null)
[0928/012546:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0928/012611:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[0928/012636:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0928/012701:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.




Sep 29, 2011
#24 davidjames@chromium.org
This is still happening even on external bots. See http://build.chromium.org/p/chromiumos/builders/x86%20generic%20commit%20queue/builds/6969/steps/VMTest/logs/stdio

Here is the log directory for the test_results.tgz in case the above gets wiped:
https://sandbox.google.com/storage/chromeos-image-archive/x86-generic-commit-queue/R16-1103.0.0-a1-b6969

INFO    : State of updater has changed to: UPDATE_STATUS_DOWNLOADING
Download progress 0.110595
Download progress 0.201129
Download progress 0.249235
Download progress 0.339991
Download progress 0.420662
Download progress 0.501333
Download progress 0.602173
Download progress 0.682844
[0929/111156:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
INFO    : State of updater has changed to: (null)
[0929/111221:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0929/111246:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0929/111311:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0929/111336:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0929/111401:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
[0929/111426:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0929/111451:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.
.[0929/111516:INFO:update_engine_client.cc(104)] Error getting status: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

Sep 29, 2011
#25 davidjames@chromium.org
Caught this hang in action on the commit queue: http://build.chromium.org/p/chromiumos/builders/x86%20generic%20commit%20queue/builds/6971

I stopped the processes in question so I could grab a snapshot. pstree attached as update-pstree.txt

$ ps -eo pid,ppid,etime,cmd | grep -E 'ssh -p|kvm ' | grep -v 'color=auto'
22842 22841    01:15:17 ssh -p 9226 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/image_to_live.NXnn/known_hosts -o ConnectTimeout=120 -i /tmp/image_to_live.NXnn/private_key root@127.0.0.1 /usr/bin/update_engine_client --status 2> /dev/null |?      grep CURRENT_OP |?      cut -f 2 -d =
26720     1    01:18:59 kvm -m 1024 -vga std -pidfile /tmp/kvm.9226 -daemonize -net nic,model=virtio -nographic -serial none -snapshot -net user,hostfwd=tcp::9226-:22 -drive file=/b/cbuild/src/build/images/x86-generic/latest-cbuildbot/chromiumos_qemu_image.bin,index=0,media=disk,cache=unsafe
31520 30825    01:18:38 ssh -p 9226 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/image_to_live.NXnn/known_hosts -o ConnectTimeout=120 -i /tmp/image_to_live.NXnn/private_key root@127.0.0.1 /usr/bin/update_engine_client --update --omaha_url http://192.168.120.13:8080/update/cache/588b8f5fe617a7e1799d2435a2fc8813
31522 31519    01:18:38 ssh -p 9226 -o StrictHostKeyChecking=no -o UserKnownHostsFile=/tmp/image_to_live.NXnn/known_hosts -o ConnectTimeout=120 -i /tmp/image_to_live.NXnn/private_key -t -t root@127.0.0.1 tail -n +0 -f /var/log/update_engine.log

I ssh'd into the kvm in question and found that it had only been up for 1h15m, so this suggested that the update_engine_client was still hanging even after the machine had rebooted.

Running update_engine_client on the machine, I get the following:

$ /usr/bin/update_engine_client --status
[0929/151438:INFO:update_engine_client.cc(253)] Querying Update Engine status...
LAST_CHECKED_TIME=0
PROGRESS=0.000000
CURRENT_OP=UPDATE_STATUS_IDLE
NEW_VERSION=0.0.0.0
NEW_SIZE=0



update-pstree.txt
3.4 KB   View   Download
Sep 30, 2011
#26 bugdroid1@chromium.org
Commit: f5850903d901db2ffe66dee7244465f845a0b2c2
 Email: davidjames@chromium.org

Tweak ssh settings to detect when remote server dies.

Right now, it seems that we don't consistently detect when the kvm is rebooted.
To improve this, I've added ServerKeepAlive messages, such that our connection
will be dropped if the target host is unreachable for 15 seconds. Further, I've
also tightened the ConnectTimeout from 120 seconds to 30 seconds so that we
detect timeouts earlier, but added 4 retries, so that all connections are
robust to temporary unresponsiveness (e.g. ssh server restarting).

BUG=chromium-os:21082, chromium-os:20305
TEST=alex-binary trybot run.

Change-Id: Ibd000bc3158138ee9c617e56bd7b11d4bd8bcb61
Reviewed-on: http://gerrit.chromium.org/gerrit/8574
Reviewed-by: Chris Sosa <sosa@chromium.org>
Tested-by: David James <davidjames@chromium.org>
Commit-Ready: David James <davidjames@chromium.org>

M	remote_access.sh
Oct 7, 2011
#27 bugdroid1@chromium.org
Commit: e59822eb23bfa5741d4b0b9bd478fb1355ed7ffc
 Email: davidjames@chromium.org

Reserve more RAM for background processes to be more conservative again.

On machines with 12GB RAM or less, we need to limit the number of parallel
VMs to two to prevent the machine from becoming so slow that timeouts occur.

This CL was pushed on Sep 20, and seemed to fix the banner timeout issues.
Then, this CL was accidentally reverted on Oct 3 and the timeouts started
occurring again.

BUG=chromium-os:20305, chromium-os:21351, chromium-os:21389
TEST=Verify this limits jobs to 2 jobs on machines with 12GB RAM. On machines
	 with more RAM we can run more jobs without slowing the tests.

Change-Id: I437ffa97847f6acb693e7b4295f85ef2af1d366a
Reviewed-on: http://gerrit.chromium.org/gerrit/9734
Reviewed-by: Darin Petkov <petkov@chromium.org>
Tested-by: David James <davidjames@chromium.org>

M	lib/test_helper.py
Oct 11, 2011
#28 davidjames@google.com
(No comment was entered for this change.)
Status: Fixed
Oct 17, 2011
#29 or...@chromium.org
Claiming work for iteration-40
Labels: Iteration-40
Oct 19, 2011
#30 kr...@chromium.org
(No comment was entered for this change.)
Labels: VerifyIn-R16
Oct 19, 2011
#31 kr...@chromium.org
(No comment was entered for this change.)
Labels: -VerifyIn-R16
Oct 24, 2011
#32 kr...@chromium.org
(No comment was entered for this change.)
Status: Verified
Nov 8, 2011
#33 chromeos...@chromium.org
(No comment was entered for this change.)
Labels: FixedIn-1111.0.0
Nov 17, 2011
#34 dhend...@chromium.org
Looks like we have hit this again on x86-alex-binary:

From http://chromegw/i/chromeos/builders/x86-alex-binary/builds/2028/steps/cbuildbot/logs/stdio :

Ran 1 test in 801.039s

OK
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 3/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Process Pool Active: Waiting on 1/6 jobs to complete
Traceback (most recent call last):
  File "bin/cros_au_test_harness", line 207, in <module>
    main()
  File "bin/cros_au_test_harness", line 194, in main
    _RunTestsInParallel(options)
  File "bin/cros_au_test_harness", line 92, in _RunTestsInParallel
    results = parallel_test_job.RunParallelJobs(options.jobs, threads, args)
  File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/../../crostestutils/lib/parallel_test_job.py", line 149, in RunParallelJobs
    ParallelJob.WaitUntilJobsComplete(parallel_jobs)
  File "/b/cbuild/x86-alex-private-bin/src/platform/crostestutils/au_test_harness/../../crostestutils/lib/parallel_test_job.py", line 91, in WaitUntilJobsComplete
    'job completion.' % cls.MAX_TIMEOUT_SECONDS)
crostestutils.lib.parallel_test_job.ParallelJobTimeoutError: Exceeded max time of 2400 seconds to wait for job completion.

@@@STEP_FAILURE@@@
Jan 20, 2012
#35 chromeos...@chromium.org
(No comment was entered for this change.)
Labels: FixedInIndex-26
Mar 6, 2013
#36 bugdroid1@chromium.org
(No comment was entered for this change.)
Labels: OS-Chrome
Blockedon: -chromium:98257 -chromium-os:20955 chromium:98257 chromium-os:20955
Mar 9, 2013
#37 bugdroid1@chromium.org
(No comment was entered for this change.)
Labels: -Area-Test -TreeCloser -Mstone-R16 Cr-Test Hotlist-TreeCloser M-16
Sign in to add a comment

Powered by Google Project Hosting