My favorites | Sign in
Project Home Downloads Wiki Issues Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 205059: Commands run by test harness become unresponsive and won't respond to signals
1 person starred this issue and may be notified of changes. Back to list
 
Project Member Reported by dianders@chromium.org, Oct 24, 2011
As you can see in the stumpy 493 canary <http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/493>, we're getting "Timed out waiting to revert DNS."

I'm sorta expecting that this is a "flaky" failure since nothing else is showing it, but time will tell (need to wait many hours for the next build to finish).

This may be a dupe of issue 18823.  ...but opening up a new bug since the code around the area has changed quite a bit (the code added in http://gerrit.chromium.org/gerrit/5484 is no longer there and new code is in its place).

---

Here's the bit from the build log:

  20:35:35 INFO | 	START	login_CryptohomeMounted	login_CryptohomeMounted	timestamp=1319488513	localtime=Oct 24 20:35:13	
  20:35:35 INFO | Bundling /build/stumpy/usr/local/autotest/client/site_tests/login_CryptohomeMounted into test-login_CryptohomeMounted.tar.bz2
  20:37:53 INFO | 		ERROR	login_CryptohomeMounted	login_CryptohomeMounted	timestamp=1319488651	localtime=Oct 24 20:37:31	Timed out waiting to revert DNS.
  20:37:53 INFO | 	END ERROR	login_CryptohomeMounted	login_CryptohomeMounted	timestamp=1319488651	localtime=Oct 24 20:37:31	

Here's the bit from the build artifacts (test_harness/testUpdateWipeStateful/2_verify/suite_Smoke/login_CryptohomeMounted/debug/login_CryptohomeMounted.DEBUG):

  10/24 13:35:40 DEBUG|base_utils:0081| Running 'logger "cros_ui.py: Attempting StopSession..."'
  10/24 13:35:40 DEBUG|base_utils:0081| Running 'restart ui'
  10/24 13:35:41 DEBUG|base_utils:0113| [stdout] ui start/running, process 10217
  10/24 13:35:42 DEBUG|miniFakeDn:0088| Response to ('127.0.0.1', 36360): www.google.com. -> 127.0.0.1
  10/24 13:35:42 DEBUG|miniFakeDn:0088| Response to ('127.0.0.1', 51050): www.google.com. -> 127.0.0.1
  10/24 13:35:42 DEBUG|miniFakeDn:0088| Response to ('127.0.0.1', 51050): www.google.com. -> 127.0.0.1
  10/24 13:35:42 DEBUG|base_utils:0081| Running 'logger "cros_ui.py: StopSession complete."'
  10/24 13:35:42 INFO |cros_ui_te:0455| Erasing fake owner state.
  10/24 13:35:42 DEBUG|cros_ui_te:0104| Attempting to resolve www.google.com. to 127.0.0.1
  10/24 13:35:42 DEBUG|base_utils:0081| Running '/usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"'
  10/24 13:35:44 WARNI|base_utils:0819| run process timeout (2) fired on: /usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"
  10/24 13:35:54 WARNI|cros_ui_te:0098| Command </usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"> failed and is not responding to signals, Command(s) did not complete within 2 seconds
  * Command: 
      /usr/bin/python -c "import socket; print
      socket.gethostbyname('www.google.com.')"
  Exit status: None
  Duration: 12.0159299374
  
  10/24 13:35:54 DEBUG|cros_ui_te:0107| Resolve attempt for www.google.com. got None
  10/24 13:35:54 ERROR|site_utils:0078| Timed out waiting to revert DNS.

---

Chris: since you have worked on the revert_dns() function a lot recently (and fixed 18823), assigning you as a preliminary owner.  If you wouldn't like to be the owner, let me know and I'll try to dig for someone else to look at this.
Oct 24, 2011
#1 cmasone@chromium.org
Any ideas, paul?
Summary: Commands run by test harness become unresponsive and won't respond to signals
Owner: ---
Cc: pstew@chromium.org
Oct 24, 2011
#2 dianders@chromium.org
Chris,

It's possible that the "not responding to signals" is a red herring and that was just a simple timeout.  It looks like the message is printed whenever "result_obj.exit_status is None".  I'm continuing to dig through the autotest code (though any ideas that Paul has would be appreciated, too).
Oct 24, 2011
#3 dianders@chromium.org
I take it back--probably not a red herring.  I see now where it must be running: "bg_job.result.exit_status = bg_job.sp.poll()", so if that is None then it means that the process never exited.

...kinda curious where the 10 seconds comes from (from 13:35:44 to 13:35:54).  Maybe the process_output() or cleanup() functions in join_bg_jobs()?

In any case, you guys know this better than I.  ;)  We can also continue to see how often this happens, since this was only the first instance of this bug.  Maybe it's an isolated incident?
Oct 25, 2011
#5 cmasone@chromium.org

10/25 15:14:13 DEBUG|base_utils:0081| Running '/usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"'
10/25 15:14:15 WARNI|base_utils:0819| run process timeout (2) fired on: /usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"
10/25 15:14:25 WARNI|cros_ui_te:0098| Command </usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"> failed and is not responding to signals, Command(s) did not complete within 2 seconds
* Command: 
    /usr/bin/python -c "import socket; print
    socket.gethostbyname('www.google.com.')"
Exit status: None
Duration: 12.0145931244

10/25 15:14:25 DEBUG|cros_ui_te:0107| Resolve attempt for www.google.com. got None
10/25 15:14:25 ERROR|site_utils:0078| Timed out waiting to revert DNS.
Oct 25, 2011
#6 cmasone@chromium.org
The magic 10s comes from the usage of nuke_subprocess() to kill the resolution attempt, which uses signal_pid():


def signal_pid(pid, sig):
    """
    Sends a signal to a process id. Returns True if the process terminated
    successfully, False otherwise.
    """
    try:
        os.kill(pid, sig)
    except OSError:
        # The process may have died before we could kill it.
        pass

    for i in range(5):
        if not pid_is_alive(pid):
            return True
        time.sleep(1)

    # The process is still alive
    return False


So, it sends 5 SIGTERMs and 5 SIGKILLs, at a rate of 1 per second.  Hence, 10.  So that's good to know.  It's also good to know that this process is apparently not listening to SIGKILL.
Oct 27, 2011
#7 bugdroid1@chromium.org
Commit: f85c56e9b4d22becc658c84bc17fb8770a3578a6
 Email: cmasone@chromium.org

[autotest] On DNS revert timeout, grab kernel stack of resolver process

It seems that these DNS timeouts are the result of the resolver process
we use getting stuck down in the kernel, as evidenced by the fact that
they will not respond to SIGKILL.

Add some temporary logging to grab the pid of the process and log the
kernel stack.

BUG=chromium-os:22019
TEST=manual; add a process that sleeps for 5s and use __log_kernel_stack('time.sleep') to get its info

Change-Id: I43ea1a09170ca66fdaa6715f7257dab036f0c415
Reviewed-on: https://gerrit.chromium.org/gerrit/10797
Commit-Ready: Chris Masone <cmasone@chromium.org>
Tested-by: Chris Masone <cmasone@chromium.org>
Reviewed-by: Paul Stewart <pstew@chromium.org>

M	client/cros/cros_ui_test.py
Nov 2, 2011
#8 cmasone@chromium.org
http://build.chromium.org/buildbot/chromiumos/builders/x86%20pineview%20full/builds/670

And, sadly, we weren't able to find the process and get a kernel stack trace.  I'll talk to scottz and see if I can find a way to instrument the autotest command-running code directly.
Status: Started
Owner: cmasone@chromium.org
Nov 2, 2011
#9 bugdroid1@chromium.org
Commit: fa5dbf681d63e8787bf00c4aa86871c3ea0a00bd
 Email: cmasone@chromium.org

[autotest] Move kernel stack logging down into utils

Instead of trying to use shell commands to find the pid
of the unkillable process whose stack we want to dump,
just instrument the code down in common_lib/base_utils.py

BUG=chromium-os:22019
TEST=suite_Smoke should still run

Change-Id: I74256a63ad74b811548b964e8f845ea05b0d2772
Reviewed-on: https://gerrit.chromium.org/gerrit/11087
Reviewed-by: Scott Zawalski <scottz@chromium.org>
Commit-Ready: Chris Masone <cmasone@chromium.org>
Tested-by: Chris Masone <cmasone@chromium.org>

M	client/common_lib/base_utils.py
M	client/cros/cros_ui_test.py
Nov 14, 2011
#10 nsand...@chromium.org
This is marked as a TreeCloser. Please confirm that it should be a TreeCloser or remove the label. Please reassign as necessary.
Labels: Pri-0
Nov 14, 2011
#11 cmasone@chromium.org
This happens rather infrequently now; hoping the logging can catch it in the act.
Labels: -TreeCloser -Pri-0 Pri-2
Jan 31, 2012
#12 cmasone@chromium.org
Wontfix as these don't seem to happen often enough to worry about.
Status: WontFix
May 15, 2012
#13 bugdroid1@chromium.org
Commit: d4e72c40c906ae602a0979fc3c8db6bf720a6a54
 Email: cmasone@chromium.org

[autotest] Remove obsolete TODO()

To help diagnose http://crosbug.com/22019, we added some kernel stack
logging code on failed process kill.  We actually think that's a good
idea, so we decided to leave it and just take out the TODO telling us
to remove it.

BUG=chromium-os:22019
TEST=suite:bvt should still run

Change-Id: Ibfce0a05009c007fb5c833ae88150ea845978600
Reviewed-on: https://gerrit.chromium.org/gerrit/22611
Tested-by: Chris Masone <cmasone@chromium.org>
Reviewed-by: Scott Zawalski <scottz@chromium.org>
Commit-Ready: Chris Masone <cmasone@chromium.org>

M	client/common_lib/base_utils.py
Sep 20, 2012
#14 thieule@chromium.org
Looks like we're hitting this bug again, reopening.

http://chromegw/i/chromeos/builders/mario%20incremental/builds/7137
Status: Untriaged
Sep 20, 2012
#15 cmasone@chromium.org
We have a different kernel and a different connection manager now.  The odds that this is the same bug is pretty small.  Please don't re-open really old bugs because of similar surface symptoms.

Please open a new bug and start a new investigation.
Status: WontFix
Mar 6, 2013
#16 lafo...@google.com
(No comment was entered for this change.)
Labels: OS-Chrome
Mar 9, 2013
#17 bugdroid1@chromium.org
(No comment was entered for this change.)
Labels: -Area-Test Cr-Test
Sign in to add a comment

Powered by Google Project Hosting