| Issue 41755: | Race condition in NDK's native event handling causes ANR | |
| 26 people starred this issue and may be notified of changes. | Back to list |
Sign in to add a comment
|
There is an issue in NDKr8b which is easily demonstrated with the native_plasma example that comes with the NDK.
When generating events from different sources/devices, the app freezes causing an ANR.
STEPS TO REPRODUCE:
1. Build native_plasma example.
2. Connect a gamepad to Android device.
3. Run native_plasma.
4. While creating touch events, jiggle a joystick on the device
NOTE: I've also been able to produce the freeze without a gamepad, and by using the hardware volume controls while swiping. However, it is a lot harder to reproduce this way, compared to jiggling joysticks.
EXPECTED RESULTS:
Continued operation of the app.
OBSERVED RESULTS:
App freezes, system creates ANR.
ADDITIONAL INFORMATION:
From the logs:
I/InputDispatcher( 281): Application is not responding: AppWindowToken{41621830 token=Token{416e2f10 ActivityRecord{412eaf18 com.example.native_plasma/android.app.NativeActivity}}} - Window{4137bce0 com.example.native_plasma/android.app.NativeActivity paused=false}. It has been 5000.9ms since event, 5000.4ms since wait started. Reason: Waiting because the focused window has not finished processing the input events that were previously delivered to it.
I/WindowManager( 281): Input event dispatching timed out sending to com.example.native_plasma/android.app.NativeActivity
I/Process ( 281): Sending signal. PID: 28191 SIG: 3
I/dalvikvm(28191): threadid=3: reacting to signal 3
I/dalvikvm(28191): Wrote stack traces to '/data/anr/traces.txt'
I/Process ( 281): Sending signal. PID: 281 SIG: 3
I/dalvikvm( 281): threadid=3: reacting to signal 3
E/SensorsHal( 281): poll() failed (Interrupted system call)
I/dalvikvm( 281): Wrote stack traces to '/data/anr/traces.txt'
I/Process ( 281): Sending signal. PID: 439 SIG: 3
I/dalvikvm( 439): threadid=3: reacting to signal 3
I/dalvikvm( 439): Wrote stack traces to '/data/anr/traces.txt'
I/Process ( 281): Sending signal. PID: 517 SIG: 3
I/dalvikvm( 517): threadid=3: reacting to signal 3
I/dalvikvm( 517): Wrote stack traces to '/data/anr/traces.txt'
D/dalvikvm( 281): GC_CONCURRENT freed 2155K, 37% free 14878K/23367K, paused 2ms+7ms, total 66ms
D/dalvikvm( 281): WAIT_FOR_CONCURRENT_GC blocked 33ms
D/dalvikvm( 281): WAIT_FOR_CONCURRENT_GC blocked 0ms
D/dalvikvm( 281): GC_EXPLICIT freed 441K, 37% free 14874K/23367K, paused 4ms+4ms, total 69ms
I/libplasma(28191): frame/s (avg,min,max) = (13.8,10.4,16.3) render time ms (avg,min,max) = (72.3,61.1,95.7)
E/Lights ( 281): write_int failed to open sys/class/leds/rk29_key_led/brightness
E/ActivityManager( 281): ANR in com.example.native_plasma (com.example.native_plasma/android.app.NativeActivity)
E/ActivityManager( 281): Reason: keyDispatchingTimedOut
E/ActivityManager( 281): Load: 0.37 / 0.34 / 0.24
E/ActivityManager( 281): CPU usage from 6219ms to 576ms ago:
E/ActivityManager( 281): 99% 28191/com.example.native_plasma: 93% user + 5.4% kernel
E/ActivityManager( 281): 2.8% 281/system_server: 1.5% user + 1.2% kernel / faults: 93 minor
E/ActivityManager( 281): 1.5% 96/surfaceflinger: 0.5% user + 1% kernel
E/ActivityManager( 281): 0.3% 1709/adbd: 0.3% user + 0% kernel
E/ActivityManager( 281): 0.1% 64/rknand_buffer: 0% user + 0.1% kernel
E/ActivityManager( 281): 0.1% 99/mediaserver: 0.1% user + 0% kernel
E/ActivityManager( 281): 0.1% 27657/kworker/u:0: 0% user + 0.1% kernel
E/ActivityManager( 281): 0.1% 27844/kworker/u:1: 0% user + 0.1% kernel
E/ActivityManager( 281): 0.1% 28018/kworker/0:3: 0% user + 0.1% kernel
E/ActivityManager( 281): 0% 28233/logcat: 0% user + 0% kernel / faults: 1 minor
E/ActivityManager( 281): 50% TOTAL: 47% user + 3.4% kernel
E/ActivityManager( 281): CPU usage from 640ms to 1155ms later with 99% awake:
E/ActivityManager( 281): 96% 28191/com.example.native_plasma: 86% user + 9.8% kernel
E/ActivityManager( 281): 96% 28262/e.native_plasma: 86% user + 9.8% kernel
E/ActivityManager( 281): 2.7% 439/com.android.systemui: 2.7% user + 0% kernel / faults: 2 minor
E/ActivityManager( 281): 1.3% 439/ndroid.systemui: 1.3% user + 0% kernel
E/ActivityManager( 281): 1.3% 96/surfaceflinger: 0% user + 1.3% kernel
E/ActivityManager( 281): 1.3% 246/SurfaceFlinger: 1.3% user + 0% kernel
E/ActivityManager( 281): 1.3% 251/EventThread: 0% user + 1.3% kernel
E/ActivityManager( 281): 53% TOTAL: 47% user + 6.7% kernel
Dec 21, 2012
#1
b.st...@gmail.com
Feb 8, 2013
I was able to reproduce this by running monkey on my native app: http://developer.android.com/tools/help/monkey.html If I disable trackball events using "--pct-trackball 0", the problem does not occur.
Mar 10, 2013
And to add more on ^ comment, if we make trackball events 100%["--pct-trackball 100"], then also problem does not occur.
Mar 21, 2013
For the record, a potential fix is at https://android-review.googlesource.com/#/c/54180/
Cc:
di...@android.com andrewhs...@google.com
Mar 21, 2013
For Native Activity, No ANR captured on monkey with the above FIX.
Mar 22, 2013
As this change is in NDK layer, it will resolve ANR only for Native Activity App But, this issue also happens sometime with normal Activities like in Launcher, Gallery3d etc. I've done little bit more analysis on this issue and come to know that the issue is with InputTransport.cpp which is responsible for sending event-message to the Looper of respective App Overall code flow: Monkey -> Input Dispatcher -> Input Transport ->...sendMessage()...-> Looper of respective App 1) Input Dispatcher publishes the event via Input Transporter and adds event to the waitQueue*. 2) But sometimes Input Transporter is not able to process the MOTION batch event if there is any next message ready from the Input Dispatcher and thus the previous batch event is ignored. 3) So the ignored MOTION batch events will not respond forever and Input Dispatcher will fire ANR after 5 sec based on waitQueue entry. (*waitQueue: Queue of events that have been published to the connection but that have not yet received a "finished" response from the application.) Change that would resolve ANR on both native & normal Activities is at https://android-review.googlesource.com/#/c/54490/
Mar 28, 2013
Hey guys not a developer but a user of custom AOSP builds on my GNex Toro... It seems with this "Fix" merged it caused severe lag in some apps. One example of this is timed word completion app called "Ruzzle" on play store. With this commit the game has serious lag completing words, and the logcat shows really long wait times to process input events. Just figured I would throw this out there for some devs to see and possibly check out.
Mar 28, 2013
Sorry my above comment only pertains to this patch - https://android-review.googlesource.com/#/c/54490/ From #6 not the one from #4
Apr 18, 2013
I'm currently developing a game that has a similar problem with InputDispatcher and MotionEvents on a Nexus 7 (4.2.2) tablet. The game locks up for up to 10 seconds and sometimes there's also the typical ANR dialog. This problem is very rare, though but during my extensive tests it happens at least once a week. I do not know what may be causing it because everything is running in separate threads and the UI thread is not bothered at all. The UI thread is only used for quickly accepting MotionEvents and putting the data in a thread safe queue that is automatically processed by a separate game loop thread. I do not have that random lock up problem on other test devices so I'm a bit lost here. What I also regularly observe in the system log is that other stuff in different processes is often going on shortly before that problem happens. Three different processes (I could not identify yet) cause heavy garbage collection. The duration of the garbage collection seems to correlate with the duration of that input dispatcher block. But how on earth is my game's app process held responsible for that, stalled and finally punished with an ANR dialog that looks like it's broken? If anybody has any information about this issue - possible causes and so on, please drop me an email!
Apr 18, 2013
(No comment was entered for this change.)
Cc:
michae...@android.com
Apr 22, 2013
Is there any fix for this issue? The fix provided works for native app only. Is there anything for the JAVA apps like com.android.gallery3d which are facing the same issue? Thanks Uday
Apr 30, 2013
If you use the patch in your input event to process the surplus events you can avoid the ANR as a workaround. Adapted from the NativeActivity example. http://pastebin.com/HATaYDG9 Works without patching the NDK.
Jun 21, 2013
Fixed by this change: https://android-review.googlesource.com/#/c/54180/1
Status:
FutureRelease
Jun 27, 2013
Hello I want to know it is https://android-review.googlesource.com/#/c/54490/ fixed the anr or https://android-review.googlesource.com/#/c/54180/1
Jul 8, 2013
AFAIK https://android-review.googlesource.com/#/c/54180/1 fix the native anr.
Jul 22, 2013
I'm getting lot of ANRs in Launcher, https://android-review.googlesource.com/#/c/54180/1 provides fix only for native applications, when Google is providing fix for non-native apps?
Jul 22, 2013
Thanks for the feedback. Please open a separate bug to track ANR issue in non-native apps. Thanks!
Jul 28, 2013
(No comment was entered for this change.)
Status:
Released
Sep 5, 2013
Hi, I was the dev that proposed the work around for this issue by doing the 'while loop'. This 'fix' made it in the official tree for the NDK, but I have to report that this case needs more examination. Although the repeated event reading does indeed dramatically lower the chances of a hang, it will not prevent it. The analytics from Developer Console still show plenty occurrences in my app. I strongly suggest pursuing the avenue laid out by comment #6, or else seriously recheck Android's event handling system for race conditions, as I am pretty sure there is a race condition in the system somewhere. As of now, I regret proposing the work-around. All it managed to do is mask a serious problem. Because occurrences go down, the real issue is less likely to get examined.
Dec 9, 2013
Hi, Have the concerns expressed on #20 by the dev that proposed the 'while loop' been addressed? Three months have passed and no response to that post...
Dec 9, 2013
This is indeed still a severe issue that is now neglected by the reduced occurrence. Please revisit. I get 40 crash reports a week in my developer console. Here is another report I received today from one of my customers who was proficient with adb, so could get me the console output. Also revisit the proposed patch by #6, or better yet: completely overhaul the event handling system. The current code is a liability.
Dec 9, 2013
> As of now, I regret proposing the work-around. That isn't a workaround, it was a bug in the NDK. Events can be sent in batches, so it's necessary to completely handle all of the events that are sent rather than just the first. In more modern versions of Android the old code would still work since the event notification mechanism is now based around epoll (i.e. is level triggered), but it would add unnecessary overhead to the processing step. >I get 40 crash reports a week in my developer console. Here is another report I received today from one of my customers who was proficient with adb, so could get me the console output. Unfortunately, this snippet leaves out a ton of information necessary to even begin diagnosing any possible issues. The state of the input system, other threads in your process, etc. The input system seems to think you haven't finished processing your last input event though (>5s of work before calling finishEvent, or there's a code path in which you do not call finishEvent at all). Having said all of that, just because you see ANRs in your dev console doesn't mean that there is currently an issue in the input system. The NDK is essentially just a thin wrapper around platform provided functionality, meaning it differs from device to device, and from Android version to Android version. It's entirely possible you're seeing a bug in some manufacturers modification to the input system. Or maybe you're seeing one that existed in 4.0 but was fixed in 4.1. Without knowing the details of the metrics, or of your application, it's impossible to really form any sort of coherent opinion on where the issue might be. Personally, I think we'd see a lot more reports of spurious ANRs from developers, including plenty from Google, if this were truly a systemic issue. For what it's worth, the native input system *was* rewritten in 4.3 in order to utilize the same code paths that the standard input system goes through. This was largely to avoid duplicating work, however.
Feb 21, 2014
I see 11 cases of ANR reported over the last day or so, on Google Play, for my app. In all these cases, my app's mainloop was identical. Looking at the callstack, three things strike me: (1) libmali.so is always involved. Could it be that it is a mali specific bug? Or is it just coincidence as Mali is the most prominent gpu in android land? (2) the main thread is waiting on a mutex for the gl driver. So it seems that it's not something bad that I do to the main thread. The key dispatch timed out not because of a blocked thread I think, rather because it somehow got lost? (3) I get zero hits in google on shim_glBindFramebuffer. What does the shim_ prefix mean? shm is a common prefix for shared memory related names. But shim_ ? "Thread-515" prio=5 tid=10 NATIVE | group="main" sCount=1 dsCount=0 obj=0x41f22ff0 self=0x40f11758 | sysTid=14087 nice=0 sched=0/0 cgrp=apps handle=1550005552 | schedstat=( 0 0 0 ) utm=52 stm=13 core=1 #00 pc 0000dce0 /system/lib/libc.so (__futex_syscall3+8) #01 pc 000121bc /system/lib/libc.so #02 pc 000589e8 /system/lib/libMali.so #03 pc 00066b9c /system/lib/libMali.so #04 pc 00019464 /system/lib/libMali.so #05 pc 00019810 /system/lib/libMali.so #06 pc 0003b2c0 /system/lib/libMali.so (shim_glBindFramebuffer+72) #07 pc 00033ed3 /data/data/com.steenriver.littlecrane/lib/liblittlecrane.so (android_main+178) #08 pc 00034327 /data/data/com.steenriver.littlecrane/lib/liblittlecrane.so #09 pc 00012e60 /system/lib/libc.so (__thread_entry+48) #10 pc 000125b8 /system/lib/libc.so (pthread_create+172) #11 pc 000004a4 <unknown> at dalvik.system.NativeStart.run(Native Method)
Feb 24, 2014
It's entirely possible that there's a graphic driver bug that's causing 5+ second long hangs, which would trigger an ANR. This has nothing to do with the NDK pipeline though; events aren't getting lost somewhere along the way, you're just hung and never get around to handling the events (so as far as the input pipeline and the user is concerned, you're not responsive). You should probably report this to the device manufacturer and / or ARM (since they make libmali), but there's not much to be done here from our side.
Jun 16, 2014
Hello all, I saw the link https://android-review.googlesource.com/#/c/54180/1 and it is said to be the change in NDK to fix this issue, but as far as i know, the code runs on the device based on the NDK that device framework has right, not with the NDK that we are compiling with, so even I make changes in my NDK file, i dont think it will effect the output on my device. This is the biggest issue i have and this is the biggest doubt i have. Please clarify i am stuck with this issue from the past few weeks. Waiting for your reply
Sep 16, 2014
We got lots of such ANR reports from android 4.4, BTW it's a non-native app. Where can I find the relative fix for non-native app, thanks.
Sep 16, 2014
(No comment was entered for this change.)
Cc:
jeffbr...@android.com
Sep 16, 2014
#26: You're kind of right, but you're looking at the wrong boundary. The code inside of AInputQueue_getEvent (and any of the similar ones) is actually what you're linking against and thus is device dependent. All of the code around it is part of your app. #27: This is strictly for ANRs seen in the NDK. At this point we're not aware of any pervasive system issues causing ANRs, so its more likely that your app has a bug in it, or specific vendor has an issue with their device. Regardless, please file another bug if you think there really is a platform bug causing it.
Feb 20, 2015
Even I am also facing same issue on 4.1.2. #00 pc 0000da70 /system/lib/libc.so (epoll_wait+12) #01 pc 000147f9 /system/lib/libutils.so (android::Looper::pollInner(int)+96) #02 pc 00014a61 /system/lib/libutils.so (android::Looper::pollOnce(int, int*, int*, void**)+104) #03 pc 00063d6f /system/lib/libandroid_runtime.so (android::NativeMessageQueue::pollOnce(_JNIEnv*, int)+22) #04 pc 0001f370 /system/lib/libdvm.so (dvmPlatformInvoke+112) #05 pc 0004e0b9 /system/lib/libdvm.so (dvmCallJNIMethod(unsigned int const*, JValue*, Method const*, Thread*)+360) #06 pc 00000214 /dev/ashmem/dalvik-jit-code-cache (deleted) at android.os.MessageQueue.nativePollOnce(Native Method) at android.os.MessageQueue.next(MessageQueue.java:125) at android.os.Looper.loop(Looper.java:124) at android.app.ActivityThread.main(ActivityThread.java:4967) at java.lang.reflect.Method.invokeNative(Native Method) at java.lang.reflect.Method.invoke(Method.java:511) at com.android.internal.os.ZygoteInit$MethodAndArgsCaller.run(ZygoteInit.java:1011) at com.android.internal.os.ZygoteInit.main(ZygoteInit.java:778) at dalvik.system.NativeStart.main(Native Method) "AudioRecord" prio=10 tid=27 NATIVE | group="main" sCount=1 dsCount=0 obj=0x4233dd90 self=0x60941890 | sysTid=11107 nice=-16 sched=0/0 cgrp=apps handle=1720559608 | schedstat=( 0 0 0 ) utm=15 stm=0 core=0 #00 pc 0000d320 /system/lib/libc.so (nanosleep+12) #01 pc 0001a9ef /system/lib/libc.so (usleep+30) #02 pc 0004cad9 /system/lib/libmedia.so (android::AudioRecord::processAudioBuffer(android::sp<android::AudioRecord::ClientRecordThread> const&)+196) #03 pc 0004cb75 /system/lib/libmedia.so (android::AudioRecord::ClientRecordThread::threadLoop()+28) #04 pc 0001106b /system/lib/libutils.so (android::Thread::_threadLoop(void*)+94) #05 pc 0004c7f5 /system/lib/libandroid_runtime.so (android::AndroidRuntime::javaThreadShell(void*)+44) #06 pc 00010be5 /system/lib/libutils.so #07 pc 00012e70 /system/lib/libc.so (__thread_entry+48) #08 pc 000125c8 /system/lib/libc.so (pthread_create+172) at dalvik.system.NativeStart.run(Native Method) Any fix on this issue.Thanks in Advance.. |
||||||||||
| ► Sign in to add a comment | |||||||||||