| Issue 205059: | Commands run by test harness become unresponsive and won't respond to signals | |
| 1 person starred this issue and may be notified of changes. | Back to list |
Sign in to add a comment
|
As you can see in the stumpy 493 canary <http://chromeos-botmaster.mtv.corp.google.com:8026/builders/stumpy%20canary/builds/493>, we're getting "Timed out waiting to revert DNS." I'm sorta expecting that this is a "flaky" failure since nothing else is showing it, but time will tell (need to wait many hours for the next build to finish). This may be a dupe of issue 18823. ...but opening up a new bug since the code around the area has changed quite a bit (the code added in http://gerrit.chromium.org/gerrit/5484 is no longer there and new code is in its place). --- Here's the bit from the build log: 20:35:35 INFO | START login_CryptohomeMounted login_CryptohomeMounted timestamp=1319488513 localtime=Oct 24 20:35:13 20:35:35 INFO | Bundling /build/stumpy/usr/local/autotest/client/site_tests/login_CryptohomeMounted into test-login_CryptohomeMounted.tar.bz2 20:37:53 INFO | ERROR login_CryptohomeMounted login_CryptohomeMounted timestamp=1319488651 localtime=Oct 24 20:37:31 Timed out waiting to revert DNS. 20:37:53 INFO | END ERROR login_CryptohomeMounted login_CryptohomeMounted timestamp=1319488651 localtime=Oct 24 20:37:31 Here's the bit from the build artifacts (test_harness/testUpdateWipeStateful/2_verify/suite_Smoke/login_CryptohomeMounted/debug/login_CryptohomeMounted.DEBUG): 10/24 13:35:40 DEBUG|base_utils:0081| Running 'logger "cros_ui.py: Attempting StopSession..."' 10/24 13:35:40 DEBUG|base_utils:0081| Running 'restart ui' 10/24 13:35:41 DEBUG|base_utils:0113| [stdout] ui start/running, process 10217 10/24 13:35:42 DEBUG|miniFakeDn:0088| Response to ('127.0.0.1', 36360): www.google.com. -> 127.0.0.1 10/24 13:35:42 DEBUG|miniFakeDn:0088| Response to ('127.0.0.1', 51050): www.google.com. -> 127.0.0.1 10/24 13:35:42 DEBUG|miniFakeDn:0088| Response to ('127.0.0.1', 51050): www.google.com. -> 127.0.0.1 10/24 13:35:42 DEBUG|base_utils:0081| Running 'logger "cros_ui.py: StopSession complete."' 10/24 13:35:42 INFO |cros_ui_te:0455| Erasing fake owner state. 10/24 13:35:42 DEBUG|cros_ui_te:0104| Attempting to resolve www.google.com. to 127.0.0.1 10/24 13:35:42 DEBUG|base_utils:0081| Running '/usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"' 10/24 13:35:44 WARNI|base_utils:0819| run process timeout (2) fired on: /usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')" 10/24 13:35:54 WARNI|cros_ui_te:0098| Command </usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"> failed and is not responding to signals, Command(s) did not complete within 2 seconds * Command: /usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')" Exit status: None Duration: 12.0159299374 10/24 13:35:54 DEBUG|cros_ui_te:0107| Resolve attempt for www.google.com. got None 10/24 13:35:54 ERROR|site_utils:0078| Timed out waiting to revert DNS. --- Chris: since you have worked on the revert_dns() function a lot recently (and fixed 18823), assigning you as a preliminary owner. If you wouldn't like to be the owner, let me know and I'll try to dig for someone else to look at this.
Oct 24, 2011
Chris, It's possible that the "not responding to signals" is a red herring and that was just a simple timeout. It looks like the message is printed whenever "result_obj.exit_status is None". I'm continuing to dig through the autotest code (though any ideas that Paul has would be appreciated, too).
Oct 24, 2011
I take it back--probably not a red herring. I see now where it must be running: "bg_job.result.exit_status = bg_job.sp.poll()", so if that is None then it means that the process never exited. ...kinda curious where the 10 seconds comes from (from 13:35:44 to 13:35:54). Maybe the process_output() or cleanup() functions in join_bg_jobs()? In any case, you guys know this better than I. ;) We can also continue to see how often this happens, since this was only the first instance of this bug. Maybe it's an isolated incident?
Oct 25, 2011
10/25 15:14:13 DEBUG|base_utils:0081| Running '/usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"'
10/25 15:14:15 WARNI|base_utils:0819| run process timeout (2) fired on: /usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"
10/25 15:14:25 WARNI|cros_ui_te:0098| Command </usr/bin/python -c "import socket; print socket.gethostbyname('www.google.com.')"> failed and is not responding to signals, Command(s) did not complete within 2 seconds
* Command:
/usr/bin/python -c "import socket; print
socket.gethostbyname('www.google.com.')"
Exit status: None
Duration: 12.0145931244
10/25 15:14:25 DEBUG|cros_ui_te:0107| Resolve attempt for www.google.com. got None
10/25 15:14:25 ERROR|site_utils:0078| Timed out waiting to revert DNS.
Oct 25, 2011
The magic 10s comes from the usage of nuke_subprocess() to kill the resolution attempt, which uses signal_pid():
def signal_pid(pid, sig):
"""
Sends a signal to a process id. Returns True if the process terminated
successfully, False otherwise.
"""
try:
os.kill(pid, sig)
except OSError:
# The process may have died before we could kill it.
pass
for i in range(5):
if not pid_is_alive(pid):
return True
time.sleep(1)
# The process is still alive
return False
So, it sends 5 SIGTERMs and 5 SIGKILLs, at a rate of 1 per second. Hence, 10. So that's good to know. It's also good to know that this process is apparently not listening to SIGKILL.
Oct 27, 2011
Commit: f85c56e9b4d22becc658c84bc17fb8770a3578a6 Email: cmasone@chromium.org [autotest] On DNS revert timeout, grab kernel stack of resolver process It seems that these DNS timeouts are the result of the resolver process we use getting stuck down in the kernel, as evidenced by the fact that they will not respond to SIGKILL. Add some temporary logging to grab the pid of the process and log the kernel stack. BUG=chromium-os:22019 TEST=manual; add a process that sleeps for 5s and use __log_kernel_stack('time.sleep') to get its info Change-Id: I43ea1a09170ca66fdaa6715f7257dab036f0c415 Reviewed-on: https://gerrit.chromium.org/gerrit/10797 Commit-Ready: Chris Masone <cmasone@chromium.org> Tested-by: Chris Masone <cmasone@chromium.org> Reviewed-by: Paul Stewart <pstew@chromium.org> M client/cros/cros_ui_test.py
Nov 2, 2011
http://build.chromium.org/buildbot/chromiumos/builders/x86%20pineview%20full/builds/670 And, sadly, we weren't able to find the process and get a kernel stack trace. I'll talk to scottz and see if I can find a way to instrument the autotest command-running code directly.
Status:
Started
Owner: cmasone@chromium.org
Nov 2, 2011
Commit: fa5dbf681d63e8787bf00c4aa86871c3ea0a00bd Email: cmasone@chromium.org [autotest] Move kernel stack logging down into utils Instead of trying to use shell commands to find the pid of the unkillable process whose stack we want to dump, just instrument the code down in common_lib/base_utils.py BUG=chromium-os:22019 TEST=suite_Smoke should still run Change-Id: I74256a63ad74b811548b964e8f845ea05b0d2772 Reviewed-on: https://gerrit.chromium.org/gerrit/11087 Reviewed-by: Scott Zawalski <scottz@chromium.org> Commit-Ready: Chris Masone <cmasone@chromium.org> Tested-by: Chris Masone <cmasone@chromium.org> M client/common_lib/base_utils.py M client/cros/cros_ui_test.py
Nov 14, 2011
This is marked as a TreeCloser. Please confirm that it should be a TreeCloser or remove the label. Please reassign as necessary.
Labels:
Pri-0
Nov 14, 2011
This happens rather infrequently now; hoping the logging can catch it in the act.
Labels:
-TreeCloser -Pri-0 Pri-2
Jan 31, 2012
Wontfix as these don't seem to happen often enough to worry about.
Status:
WontFix
May 15, 2012
Commit: d4e72c40c906ae602a0979fc3c8db6bf720a6a54 Email: cmasone@chromium.org [autotest] Remove obsolete TODO() To help diagnose http://crosbug.com/22019, we added some kernel stack logging code on failed process kill. We actually think that's a good idea, so we decided to leave it and just take out the TODO telling us to remove it. BUG=chromium-os:22019 TEST=suite:bvt should still run Change-Id: Ibfce0a05009c007fb5c833ae88150ea845978600 Reviewed-on: https://gerrit.chromium.org/gerrit/22611 Tested-by: Chris Masone <cmasone@chromium.org> Reviewed-by: Scott Zawalski <scottz@chromium.org> Commit-Ready: Chris Masone <cmasone@chromium.org> M client/common_lib/base_utils.py
Sep 20, 2012
Looks like we're hitting this bug again, reopening. http://chromegw/i/chromeos/builders/mario%20incremental/builds/7137
Status:
Untriaged
Sep 20, 2012
We have a different kernel and a different connection manager now. The odds that this is the same bug is pretty small. Please don't re-open really old bugs because of similar surface symptoms. Please open a new bug and start a new investigation.
Status:
WontFix
Mar 6, 2013
(No comment was entered for this change.)
Labels:
OS-Chrome
Mar 9, 2013
(No comment was entered for this change.)
Labels:
-Area-Test Cr-Test
|
||||||||||||
| ► Sign in to add a comment | |||||||||||||
Owner: ---
Cc: pstew@chromium.org