My favorites | Sign in
Project Home Downloads Wiki Issues Code Search
New issue   Search
for
  Advanced search   Search tips   Subscriptions
Issue 395914: maps_pixel_test failing frequently on win_gpu_triggered_tests try server
3 people starred this issue and may be notified of changes. Back to list
 
Project Member Reported by kbr@chromium.org, Jul 21, 2014
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests?numbuilds=200

maps_pixel_test is timing out often -- rough estimate 25% of runs. This is an unacceptably high failure rate.

Here are some example failing builds:

http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30551
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30554
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30555
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30557
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30561
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30570
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30577
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30578
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30586
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30588
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30589
http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30592

The failures aren't localized to any particular machine -- they seem to be happening on many different machines.

I'm attempting to reproduce locally but have not yet been able to. There's a possibility that this may have been caused by recent enabling of release mode asserts in Blink. Another likely possibility is a V8 roll since this is one of the tests that runs the most JavaScript on any of the waterfalls.

The failure mode is the same everywhere:

INFO:root:  rasterization       : unavailable_software
INFO:root:  threaded_rasterization: disabled_off_ok
INFO:root:  video_decode        : enabled
INFO:root:  video_encode        : enabled
INFO:root:  webgl               : enabled
INFO:root:Driver Bug Workarounds:
INFO:root:  clear_uniforms_before_first_program_use
INFO:root:  exit_on_context_lost
INFO:root:  texsubimage2d_faster_than_teximage2d
INFO:root:  use_current_program_after_successful_link
INFO:root:Running http://localhost:10020/tracker.html
Traceback (most recent call last):
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_runner.py", line 531, in _RunPage
page_state.ImplicitPageNavigation(test)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_runner.py", line 175, in ImplicitPageNavigation
test.RunNavigateSteps(self.page, self.tab)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\page_test.py", line 249, in RunNavigateSteps
page.RunNavigateSteps(action_runner)
File "c:\users\chrome~1.chr\appdata\local\temp\run_tha_test1wnp2p\content\test\gpu\gpu_tests\maps.py", line 86, in RunNavigateSteps
'window.testDone', timeout_in_seconds=180)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\page\actions\action_runner.py", line 167, in WaitForJavaScriptCondition
self._tab.WaitForJavaScriptExpression(condition, timeout_in_seconds)
File "..\content\test\gpu\..\..\..\tools\telemetry\telemetry\core\web_contents.py", line 69, in WaitForJavaScriptExpression
"""))
TimeoutException: Timed out while waiting 180s for IsJavaScriptExpressionTrue.

It doesn't look like the tab crashes, but that it fails to complete the test.

Jul 22, 2014
#1 kbr@chromium.org
This just failed on the Win7 Debug (NVIDIA) bot on the main waterfall:

http://build.chromium.org/p/chromium.gpu/builders/Win7%20Debug%20%28NVIDIA%29/builds/20622

Possibly because it's a Debug-only bot, it printed more information to the console than Release builds:

INFO:root:  video_encode        : enabled
INFO:root:  webgl               : enabled
INFO:root:Driver Bug Workarounds:
INFO:root:  clear_uniforms_before_first_program_use
INFO:root:  exit_on_context_lost
INFO:root:  texsubimage2d_faster_than_teximage2d
INFO:root:  use_current_program_after_successful_link
INFO:root:Running http://localhost:10020/tracker.html
[3868:2244:0722/122000:INFO:CONSOLE(32)] "Uncaught ReferenceError: vectortown is not defined", source: http://localhost:10020/tracker.html (32)
[3868:2244:0722/122303:INFO:CONSOLE(0)] "'window.webkitStorageInfo' is deprecated. Please use 'navigator.webkitTemporaryStorage' or 'navigator.webkitPersistentStorage' instead.", source:  (0)

The uncaught ReferenceError sounds strongly like an intermittent V8 bug to me. 
Cc: hpa...@chromium.org tit...@chromium.org jkummerow@chromium.org mst....@gmail.com
Jul 22, 2014
#2 kbr@chromium.org
CC'ing several V8 team members. I am attempting to reproduce this on the bots. I ask that a counterpart from the V8 team attempt to reproduce this problem locally as well. The steps to reproduce would be:

python.exe content\test\gpu\run_gpu_test maps --browser=[release,debug]

Cc: -mst....@gmail.com mstarzinger@chromium.org yangguo@chromium.org dslomov@chromium.org
Jul 22, 2014
#3 kbr@chromium.org
Note that you'll have to run "gsutil config" and authenticate with an @google.com email address in order to download the Maps WPR archive.

Jul 22, 2014
#4 kbr@chromium.org
I've caught a failing run in Telemetry on one of the bots. When the problem happens, "Uncaught ReferenceError: vectortown is not defined" is printed in the JavaScript console, just as the Debug bot reported.

Some race condition has been introduced recently causing this failure. Possibilities include V8, DevTools, or Telemetry.

Cc: d...@chromium.org tonyg@chromium.org pfeld...@chromium.org yurys@chromium.org
Labels: Cr-Platform-DevTools Cr-Tests-Telemetry
Jul 22, 2014
#5 kbr@chromium.org
I'm able to reproduce the failure on build161-a4 (one of the slaves backing win_gpu_triggered_tests) by running the isolate from this whitespace CL:

http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu_triggered_tests/builds/30840

python.exe swarming.client\run_isolated.py -H 4478b577fa9509ded96e8e952a153e21ecf834b0 -I https://isolateserver.appspot.com -- maps --browser=release

Currently building my own isolate to try reproducing on this machine. At that point will start reverting likely changes and seeing whether they fix the flakiness.

Jul 22, 2014
#6 kbr@chromium.org
Here's an interesting failure on Mac

http://build.chromium.org/p/tryserver.chromium.gpu/builders/mac_gpu_triggered_tests?numbuilds=200

http://build.chromium.org/p/tryserver.chromium.gpu/builders/mac_gpu_triggered_tests/builds/26817

INFO:root:  validate_multisample_buffer_allocation
[804:1799:0722/134601:FATAL:thread_restrictions.cc(72)] Waiting is not allowed to be used on this thread to preventjank and deadlock.
** Unknown exception behavior: 0

The test run timed out after 20 minutes.

Cc: ccameron@chromium.org
Labels: OS-Mac
Jul 22, 2014
#7 kbr@chromium.org
Reproduced the "Uncaught ReferenceError: vectortown is not defined" error on build161-a4 with binaries built on my machine.

Cc: wfh@chromium.org
Jul 22, 2014
#8 kbr@chromium.org
It doesn't look like recent changes to V8 are at fault. I built and pushed binaries reverting V8 to 3.28.25 (r22148) and the failure still occurred.

Jul 22, 2014
#9 kbr@chromium.org
I'm seeing the following behavior:

- When running a particular telemetry_gpu_test.isolate hosted on the isolate server via swarming.client\run_isolated.py -H [hash] -I https://isolateserver.appspot.com [args], the failure occurs reliably on the affected machine.

- When downloading that isolate to the local disk on the affected machine and running content\test\gpu\run_gpu_test.py maps [args], the failure does not occur.

I noticed this because I attempted to download a continuous build and wasn't able to reproduce the failure when running a previously failing isolate's run_gpu_test.py against that continuous build.

Did something change recently in run_isolated.py's setup of the temporary directory for running the test?

Cc: maruel@chromium.org vadimsh@chromium.org csharp@chromium.org
Jul 22, 2014
#10 kbr@chromium.org
Also, this failure happened on the Linux Debug (NVIDIA) bot as well:

http://build.chromium.org/p/chromium.gpu/builders/Linux%20Debug%20%28NVIDIA%29/builds/31185

INFO:root:Running http://localhost:10020/tracker.html
[2997:3033:0722/184937:WARNING:raw_channel_posix.cc(214)] recvmsg: Connection reset by peer
[2997:2997:0722/184937:INFO:CONSOLE(32)] "Uncaught ReferenceError: vectortown is not defined", source: http://localhost:10020/tracker.html (32)
[2997:3023:0722/184946:ERROR:cert_verify_proc_nss.cc(880)] CERT_PKIXVerifyCert for ssl.gstatic.com failed err=-8172
[2997:2997:0722/185240:INFO:CONSOLE(0)] "'window.webkitStorageInfo' is deprecated. Please use 'navigator.webkitTemporaryStorage' or 'navigator.webkitPersistentStorage' instead.", source:  (0)
Traceback (most recent call last):
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_runner.py", line 531, in _RunPage
page_state.ImplicitPageNavigation(test)
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_runner.py", line 175, in ImplicitPageNavigation
test.RunNavigateSteps(self.page, self.tab)
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/page_test.py", line 249, in RunNavigateSteps
page.RunNavigateSteps(action_runner)
File "/tmp/run_tha_testZRby7h/content/test/gpu/gpu_tests/maps.py", line 86, in RunNavigateSteps
'window.testDone', timeout_in_seconds=180)
File "../content/test/gpu/../../../tools/telemetry/telemetry/page/actions/action_runner.py", line 167, in WaitForJavaScriptCondition
self._tab.WaitForJavaScriptExpression(condition, timeout_in_seconds)
File "../content/test/gpu/../../../tools/telemetry/telemetry/core/web_contents.py", line 69, in WaitForJavaScriptExpression
"""))
TimeoutException: Timed out while waiting 180s for IsJavaScriptExpressionTrue.

Labels: -OS-Windows -OS-Mac OS-All
Jul 22, 2014
#11 maruel@chromium.org
I'll speculate the fact it occurs on one host not another one is likely related to the memory layout, e.g. rebooting would cause an ASLR shuffle which would alter reproducibility. I'd be quite surprised if run_isolated is at fault.
Jul 22, 2014
#12 kbr@chromium.org
I modified run_isolated.py locally to save off the temporary directory it creates. Invoking the test via the isolate hung, presumably because of the "Uncaught ReferenceError: vectortown is not defined" error. Subsequently invoking run_gpu_test.py out of the temporary directory succeeded.

Still trying to figure out where the race condition is. It's certainly possible that the first invocation via run_isolated.py looks like a cold start of the browser and subsequent invocations from the local disk look like warm start, masking the race. The failure happens reliably on the GPU try servers and has been seen on all platforms now.

Jul 22, 2014
#13 wfh@chromium.org
+scottmg purely because aslr was mentioned.
Cc: scottmg@chromium.org
Jul 22, 2014
#14 kbr@chromium.org
The problem is 99% certainly a code change in Chromium or subordinate workspaces.

My telemetry_gpu_test isolate build at Chromium r284848 / Blink r178681 reliably hangs.

My telemetry_gpu_test isolate build at Chromium r281891 / Blink r177644 works. (Ran 5 times.)

Given this the problem can be bisected. Will work on this tomorrow.

I had no success reproducing the hangs using isolates from http://build.chromium.org/p/chromium.gpu/builders/GPU%20Win%20Builder . They always ran successfully. I don't know what's the essential difference between builds coming from there and from http://build.chromium.org/p/tryserver.chromium.gpu/builders/win_gpu . Also, as mentioned before, was not able to reproduce the hang using Chromium binaries from the continuous build archive at https://commondatastorage.googleapis.com/chromium-browser-continuous/index.html .

Jul 22, 2014
#15 kbr@chromium.org
Note that in both of the builds above I removed the dcheck_always_on=1 and blink_asserts_always_on=1 GYP_DEFINES. They are Release mode builds without asserts.

Jul 23, 2014
#16 phajdan.jr@chromium.org
(No comment was entered for this change.)
Labels: Infra Build-CommitQueue
Jul 23, 2014
#17 bugdro...@chromium.org
------------------------------------------------------------------
r284937 | kbr@chromium.org | 2014-07-23T16:06:29.108378Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/recipes/gpu/build_and_test.expected/win_debug_tryserver.json?r1=284937&r2=284936&pathrev=284937
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/recipe_modules/gpu/api.py?r1=284937&r2=284936&pathrev=284937
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/recipes/gpu/build_and_test.expected/win_release_tryserver.json?r1=284937&r2=284936&pathrev=284937

Temporarily disable maps_pixel_test on win_gpu_triggered_tests.

Causing CQ backlogs while root cause is being investigated.

BUG=395914

Review URL: https://codereview.chromium.org/413753002
-----------------------------------------------------------------
Jul 23, 2014
#18 kbr@chromium.org
Issue 396273 has been merged into this issue.
Jul 23, 2014
#19 kbr@chromium.org
(No comment was entered for this change.)
Cc: vmp...@chromium.org
Jul 23, 2014
#20 kbr@chromium.org
(No comment was entered for this change.)
Status: Started
Jul 23, 2014
#21 bugdro...@chromium.org
------------------------------------------------------------------
r285048 | kbr@chromium.org | 2014-07-23T21:40:41.145420Z

Changed paths:
   M http://src.chromium.org/viewvc/chrome/trunk/tools/build/scripts/slave/gatekeeper.json?r1=285048&r2=285047&pathrev=285048

Temporarily ignore maps_pixel_test failures while investigating root cause of bug.

BUG=395914

Review URL: https://codereview.chromium.org/418553004
-----------------------------------------------------------------
Jul 23, 2014
#22 kbr@chromium.org
The flakiness was introduced in this Blink roll:

commit a6283237b69e4eedd3c95988883140113e3b30fd
Author: chrome-admin@google.com <chrome-admin@google.com@0039d316-1c4b-4281-b951-d872f2087c98>
Date:   Tue Jul 22 18:37:12 2014 +0000

    Update .DEPS.git
    
    git-svn-id: svn://svn.chromium.org/chrome/trunk/src@284740 0039d316-1c4b-4281-b951-d872f2087c98


Continuing to bisect into that roll.

Cc: mkwst@chromium.org simonha...@chromium.org
Jul 23, 2014
#23 kbr@chromium.org
Sorry, to be clearer:

commit 6be076326b470b9873732cdaa90a78fc98f497e0
Author: blink-deps-roller@chromium.org <blink-deps-roller@chromium.org@0039d316-1c4b-4281-b951-d872f2087c98>
Date:   Tue Jul 22 18:36:27 2014 +0000

    Blink roll 178550:178666
    
    http://build.chromium.org/f/chromium/perf/dashboard/ui/changelog_blink.html?url=/trunk&range=178551:178666&mode=html
    TBR=johnme@chromium.org,pkasting@chromium.org
    BUG=
    
    Review URL: https://codereview.chromium.org/407143002
    
    git-svn-id: svn://svn.chromium.org/chrome/trunk/src@284738 0039d316-1c4b-4281-b951-d872f2087c98

Jul 23, 2014
#24 kbr@chromium.org
After a tedious manual bisect the root cause was:

commit 5d7c2576351c3333f74065a5c92f150abcacc173
Author: mkwst@chromium.org <mkwst@chromium.org@bbb929c8-8fbe-4397-9dbb-9b2b20218538>
Date:   Mon Jul 21 17:42:46 2014 +0000

    Teach ContentSecurityPolicy about WebURLRequest::RequestContext.
    
    Rather than using `Resource::Type` to make decisions about what CSP
    directive to apply to a resource request, we should be using
    WebURLRequest::RequestContext. This patch teaches ContentSecurityPolicy
    to correctly compare a URL against the right policy directive when given
    a 'ResourceContext', and pipes the value through
    'ResourceFetcher::canRequest' to make sure its available when it needs
    to be.
    
    BUG=390497
    
    Review URL: https://codereview.chromium.org/398313002
    
    git-svn-id: svn://svn.chromium.org/blink/trunk@178571 bbb929c8-8fbe-4397-9dbb-9b2b20218538


This CL was already discovered as the cause of  Issue 396582  and has already been reverted on Blink top of tree.

Cc: haraken@chromium.org
Jul 23, 2014
#25 kbr@chromium.org
(No comment was entered for this change.)
Blocking: chromium:390497
Jul 23, 2014
#26 tonyg@chromium.org
Do we understand why the symptoms showed up in this way? I ask because I'm interested whether there's anything Telemetry could have done differently to make this diagnosable without a bisect.
Jul 23, 2014
#27 kbr@chromium.org
No, I don't personally understand. The side-effect looked like a race condition in navigation, but studying https://codereview.chromium.org/398313002 it doesn't look to me like the flow of control changed.

Jul 24, 2014
#28 mkwst@chromium.org
Apologies for the breakage. A race-condition in navigation is a very unexpected side-effect of the change that was reverted; it's not clear to me why that would happen, but I'll investigate today.
Jul 24, 2014
#30 kbr@chromium.org
The flakiness has been addressed and maps_pixel_test has been re-enabled.

Closing as duplicate of  Issue 396582  under which the CL was reverted.

Status: Duplicate
Mergedinto: 396582
Sign in to add a comment

Powered by Google Project Hosting