-
Notifications
You must be signed in to change notification settings - Fork 17.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
runtime: windows/386 SEH corruption #7325
Labels
Milestone
Comments
One of the things that divmod tests is that a division zero causes a panic, a panic that can be caught by a deferred call to recover. Evidently that is not working on Windows. I think it's supposed to work by having runtime·minit call runtime·install_exception_handler which installs an SEH handler that calls runtime·sigtramp. |
I just checked recent history, and I can see everything works fine up until (including): changeset: 19088:19f24bb3c8f8 user: Nicolas Owens <mischief@offblast.org> date: Thu Feb 13 10:26:16 2014 -0500 summary: net: only return unique hosts during hostname lookup on plan 9 version from changeset: 19089:a7a59cfa61d2 user: David du Colombier <0intro@gmail.com> date: Thu Feb 13 16:35:51 2014 +0100 summary: cmd/gc: catch notes on Plan 9 and up until (including) changeset: 19091:223d1c05e7c2 user: Russ Cox <rsc@golang.org> date: Thu Feb 13 11:10:31 2014 -0500 summary: runtime: introduce MSpan.needzero instead of writing to span data don't build on windows. But next version after changeset: 19092:c3319d654b61 user: Dmitriy Vyukov <dvyukov@google.com> date: Thu Feb 13 20:15:19 2014 +0400 summary: cmd/gc: fix windows build already has this defect. Alex |
I applied changeset: 19092:c3319d654b61 user: Dmitriy Vyukov <dvyukov@google.com> date: Thu Feb 13 20:15:19 2014 +0400 summary: cmd/gc: fix windows build right onto changeset: 19089:a7a59cfa61d2 user: David du Colombier <0intro@gmail.com> date: Thu Feb 13 16:35:51 2014 +0100 summary: cmd/gc: catch notes on Plan 9 to fix the build. And I can see now that this issue is started with: changeset: 19090:a8b97f205736 user: Dmitriy Vyukov <dvyukov@google.com> date: Thu Feb 13 19:36:45 2014 +0400 summary: runtime: fix concurrent GC sweep Alex |
I noticed our windows-amd64 builder crashes with different error. I can reproduce it here on my windows-amd64 pc: c:\go\root\test>divmod.exe runtime: newstack framesize=0x0 argsize=0x18 sp=0x28fb50 stack=[0x290000, 0x292fa0] morebuf={pc:0x421074 sp:0x28fb60 lr:0x0} sched={pc:0x410f20 sp:0x28fb58 lr:0x0 ctxt:0x291508} runtime: split stack overflow: 0x28fb50 < 0x290000 fatal error: runtime: split stack overflow runtime stack: runtime.throw(0x467589) c:/go/root/src/pkg/runtime/panic.c:464 +0x74 runtime.newstack() c:/go/root/src/pkg/runtime/stack.c:247 +0x1ff runtime.morestack() c:/go/root/src/pkg/runtime/asm_amd64.s:228 +0x58 goroutine 16 [stack split]: runtime.sighandler(0x2907a0, 0x2902b0, 0xc082012000) c:/go/root/src/pkg/runtime/os_windows_amd64.c:36 fp=0x28fb60 runtime: unexpected return pc for runtime.sigtramp called from 0x2907a0 runtime.sigtramp() ?:0 +0x54 fp=0x28fb68 created by _rt0_go c:/go/root/src/pkg/runtime/asm_amd64.s:97 +0x121 goroutine 17 [syscall]: runtime.notetsleepg(0x2bff60, 0xdf8475800) c:/go/root/src/pkg/runtime/lock_sema.c:263 +0x83 runtime.MHeap_Scavenger() c:/go/root/src/pkg/runtime/mheap.c:531 +0xac runtime.goexit() c:/go/root/src/pkg/runtime/proc.c:1438 created by runtime.main c:/go/root/src/pkg/runtime/proc.c:191 goroutine 18 [runnable]: runtime.gosched() c:/go/root/src/pkg/runtime/proc.c:1412 +0x2a bgsweep() c:/go/root/src/pkg/runtime/mgc0.c:1903 +0x46 runtime.goexit() c:/go/root/src/pkg/runtime/proc.c:1438 created by runtime.gc c:/go/root/src/pkg/runtime/mgc0.c:2187 It looks like stack is suddenly empty. How can that be? And why it only happens on windows? The only windows difference I can think of, is StackSystem (extra room for windows exception handler code)? Is it possible that some of our new code does not account for that? Another possibility is memory corruption? I'm stumped. Looking for suggestions. Alex |
I can reproduced it. http://go-gyazo.appspot.com/c013d0a359029305.png |
I can reproduce it. http://go-gyazo.appspot.com/c013d0a359029305.png |
I think this CL should be fixed now by: changeset: 19326:2750cd9fc49b tag: tip user: Russ Cox <rsc@golang.org> date: Mon Mar 03 19:55:40 2014 -0500 files: src/cmd/gc/typecheck.c test/fixedbugs/issue7310.go description: cmd/gc: fix internal crash TBR=ken2 CC=golang-codereviews https://golang.org/cl/70200053 Status changed to Fixed. |
My mistake. I was meant to say: fixed by runtime: fix traceback on Windows The exception handler runs on the ordinary g stack, and the stack copier is now trying to do a traceback across it. That's never been needed before, so it was unimplemented. Implement it, in all its ugliness. Fixes windows/amd64 build. TBR=khr CC=golang-codereviews https://golang.org/cl/71030043 |
[Hi Hector! Question for you at the bottom.] This is still happening on the windows/386 builder. I was able to connect to a broken divmod with gdb and take a look. For future reference, when you're looking for thread-local storage on 32-bit windows, it seems that the thread structures are allocated 0x7ffde000 moving downward: 0x7ffde000 is thread 1, 0x7ffdd000 is thread 2, and so on: 0x7ffdb000 is thread 4 is as far as I got in this program. Or maybe gdb numbers the threads in reverse and the thread structures are actually allocated upward. Anyway, there is a divide by zero happening (as expected) but the SEH handler for the Go thread has been removed from the SEH chain. That is, the chain should look like 0(FS) = m->seh = 0x31ff78 seh at 0x31ff78 = {0x31ffc4, 0x41ffb0} seh at 0x31ffc4 = {0x31ffe4, 0x77459ac2} seh at 0x31ffe4 = {0xffffffff, 0x774f7b76} The 0x41ffb0 is our handler. the others are windows handlers earlier in the chain. In the broken divmod program, 0(FS) is 0x31ffc4: the go SEH block has been removed. But who removed it? The only code in Go that removes a SEH block is used during calls from C back into Go, and that is not happening here. So it seems like Windows removed it. Why? The only time Windows is supposed to remove a SEH structure is if it declines to handle an exception and then an outer SEH does handle it. The assumption is that the code is returning to something near the outer SEH, so the inner code is gone and the inner SEH should be gone too. The only thing runtime.sighandler passes on is EXCEPTION_BREAKPOINT: switch(info->ExceptionCode) { case EXCEPTION_BREAKPOINT: r->Eip--; // because 8l generates 2 bytes for INT3 return 1; } If Windows sent an EXCEPTION_BREAKPOINT, then Go's handler would pass, an outer handler would handle it, and Go's handler would be removed. But I do not understand why Windows would send an EXCEPTION_BREAKPOINT. But also if this didn't happen in practice it's hard to believe the code would be here at all. Hector, this code was in the very first implementation of Windows exception handling, in https://golang.org/cl/4079041 aka b897d583e187. Clearly you had hit this, because otherwise you wouldn't have thought to do the r->Eip--. How did this come up? Can it come up if a debugger is not involved and the program is not crashing? If it's not the breakpoint explanation, then I have no idea why Windows would remove the SEH block. Maybe there is just a Windows kernel bug, but that's not encouraging. I ran the test (divmod.exe) over and over and over by itself and couldn't make it fail. But it fails somewhat regularly (maybe 1/4 of the time) during the builder all.bat tests. Thread 1 (the main thread, I can tell) is not running Go code at the moment but also has a broken SEH. It is in Windows code that may have pushed its own SEH, but the SEH link pointer it has doesn't make sense, and it does not lead back to the Go SEH frame for that m. It is hard to see what Go could be doing wrong here other than the breakpoint thing, but it is also hard to see how the breakpoint thing could happen. If it's really a Windows bug, our options for working around this seem limited. One possibility would be to always update the SEH head pointer at 0(FS) during runtime.gogo, so that on entry to any goroutine, we know we have the right SEH setting. That's the best I can come up with right now. Anyone else? Hector? Thanks. Russ Status changed to Accepted. |
This issue was updated by revision b2fa6f4. LGTM=bradfitz R=golang-codereviews, bradfitz CC=golang-codereviews https://golang.org/cl/72590044 |
I use Linux these days (and no Windows on my computer), so it's hard for me to do any investigation. And besides I'd have to pick it all up again. But the breakpoint handler is just so that when a debugger is attached and a breakpoint is set in the code, it will break into the debugger. The IP correction is so that the debugger points at the right location (otherwise it'd be off-by-one). |
Russ, I saw this bug on my windows XP. Also see comments in this issue from others. Also our builder is not windows 2000. So it is not specific to windows 2000. I didn't see this bug since your last fix, but I didn't try hard enough. I will try again, but I have no competes here. I will try next week. Alex |
Russ, I also think we need to consider scenario of exception happens while running go exception handler. Out exception handler can do anything our normal program does. None of exception handler functions are marked as no-grow-stack. Perhaps it is good idea to rewrite handler to do very minimum. Alex |
I've looked at a handful of these on the builder. The typical failure trace looks like this: (gdb) thread 3 [Switching to thread 3 (Thread 664.0x9a4)] #0 0x774b5f4f in ntdll!_chkstk () from C:\Windows\system32\ntdll.dll (gdb) where #0 0x774b5f4f in ntdll!_chkstk () from C:\Windows\system32\ntdll.dll #1 0x75f60262 in OpenProfileUserMapping () from C:\Windows\system32\kernel32.dll #2 0x003a06dc in ?? () #3 0x75f3fa12 in KERNEL32!GetApplicationRestartSettings () from C:\Windows\system32\kernel32.dll #4 0x75f5e080 in OpenProfileUserMapping () from C:\Windows\system32\kernel32.dll #5 0xffffffff in ?? () #6 0x003a0708 in ?? () #7 0x75f4042e in UnhandledExceptionFilter () from C:\Windows\system32\kernel32.dll #8 0x003a07e4 in ?? () #9 0x774c80f9 in ntdll!EtwpNotificationThread () from C:\Windows\system32\ntdll.dll #10 0x77491696 in ntdll!RtlInitializeNtUserPfn () from C:\Windows\system32\ntdll.dll #11 0x00000000 in ?? () (gdb) Many of these symbols are bogus: gdb only knows about symbols in the dynamic link tables, so many of these are just the last exported symbol before a large chunk of unexported symbols. However, ntdll!_chkstk is accurate, as is KERNEL32!GetApplicationRestartSettings. ntdll!_chkstk is what Windows calls from any function that allocates a large stack frame (>4kB). It works by touching one word per new page created, so that if you have just a single 4kB guard page for your stack, you can't skip over it. The problem here is that this code is running on a Go stack, presumably in response to the divide-by-zero. The function mislabeled as OpenProfileUserMapping is trying to allocate a 4300-byte stack frame, but Go only leaves 2 kB for Windows (StackSystem in runtime.h). The hope was that 2k is "big enough" but apparently it is not. This happens in divmod with some regularity because I wrote divmod to generate test cases using a deeply recursive function. Eventually this check ends up running near the bottom of a stack that happens to have nothing mapped just before it in memory, and _chkstk works as intended and faults. The new copying stacks exacerbate the problem, but it could happen in the old stacks too. I believe it started happening reliably when we reenabled concurrent sweep only because concurrent sweep created a new couple stack frames and therefore disrupted the usual allocation patterns. I would like to see a disassembly of the KiUserExceptionDispatcher implementation [1] along with the things it uses, but I have been unable to produce one. Gdb seems not to know enough about the symbols, and windbg is inscrutable. I have also observed that the SEH information is often corrupt in the thread with the problem. It is possible that this stack overflow is a distraction and only happens because the Go SEH registration has been lost and we're stuck in Windows SEH routines. It is also possible that this stack overflow - happening on some other thread, earlier, undetected - is why the SEH registrations are being lost. I think that's not too likely because it makes the most sense for GetApplicationRestartSettings to be called during a crash, and we are trying to understand how the crash came to be, not bugs revealed during its handling. Of course, it could be some other stack overflow. What else might Windows be running at the bottom of the goroutine stacks that we're unaware of? On the SEH registration problem, one theory is that when Windows code runs on the Go stack, it registers and then deregisters its own SEH handlers. Normally we make sure that the SEH chain is on the m (aka g0, aka OS) stack, but this Windows code is executing on an arbitrary Go stack. It may be confused by seeing SEH registrations not on the OS stack. In particular, the thread information block contains two words giving the bounds of the OS stack, and some sources online claim that if Windows sees SEH entries outside those bounds it will ignore/reject them somehow [2]: The OS is pretty paranoid about corrupt stacks during this chain traversal. It checks that all chain entries are within the bounds of the stack. (These bounds are also recorded in the TEB). The OS also checks that all entries are in ascending order on the stack. If you violate these rules, the OS will consider the stack to be corrupt and will be unable to process exceptions. This is one of the reasons that a Win32 application cannot break its stack into multiple disjoint segments as an innovative technique for dealing with stack overflow. We have certainly not observed that, however, and I think if it were true we'd have seen more problems. I wonder if the deregistration of handlers during an exception is based on stack address. If so, suppose that the Go SEH handler is registered on the OS stack at address 0x6fff0, and then we make a Go stack at 0x1010000, and then Windows code running on that Go stack pushes a SEH handler or two around 0x100fff0, and then an exception happens. It is possible that the unwind of the SEH stack during the exception is a while loop that waits until it finds an entry larger than the stack address being returned to. (That's similar to what we do in panic.) Windows would see the 0x6fff0 address as "lower" on the stack than 0x100fff0, instead of realizing that they are two different stacks entirely, and incorrectly remove the top-level SEH handler we've registered along with the others. The next exception would not be caught by the now-removed Go SEH, and we end up seeing the failure we see today. It is possible that we could bypass some of this by switching to Vectored Exception Handling [3]. VEH is nice because it applies to the whole program, not just one thread, so they are not constantly being registered and deregistered. We can register it once and not worry about the handler not being called or being overridden by Windows code due to confusion about stack addresses. Go's VEH handler could pick off just the entries with a PC in the Go text segment and let the others fall through to the other handlers and then eventually to SEH. This is similar to what we do on windows/amd64, except that on amd64 the handler information is in the executable header instead of registered at run time. If the VEH implementation is low-level enough and doesn't call too many other functions, it might fit in the 2k we are currently reserving. Using VEH would exclude Windows 2000 (requiring Windows XP/Windows Server 2003 or later). Another possibility is to have a separate SEH chain for every goroutine stack, and make the goroutine switch update all three words in the thread information block: stack bottom, stack top, stack, SEH chain. This would mean that the Windows per-thread SEH handlers would be excluded from executing (ever), which might break debug breakpoints (the only time Go's SEH handler ever declines an opportunity to handle an exception). It is also possible that we are causing this problem ourselves, similar to issue #7470. However, we do next to no manipulation of the SEH ourselves, and the code that was buggy in 7470 is not involved at all in test/divmod.go (no cgo). I really don't know what to do here. None of the options are great. Using VEH is probably simplest, if we can drop win2k and if it helps, but there's no guarantee it will fix the general problem. It's easy to try though, I guess. [1] http://www.microsoft.com/msj/0197/exception/exception.aspx [2] http://blogs.msdn.com/b/cbrumme/archive/2003/10/01/51524.aspx [3] http://msdn.microsoft.com/en-us/magazine/cc301714.aspx (gdb) x/20i 0x75f60253 0x75f60253 <OpenProfileUserMapping+20435>: mov %edi,%edi 0x75f60255 <OpenProfileUserMapping+20437>: push %ebp 0x75f60256 <OpenProfileUserMapping+20438>: mov %esp,%ebp 0x75f60258 <OpenProfileUserMapping+20440>: mov $0x10cc,%eax 0x75f6025d <OpenProfileUserMapping+20445>: call 0x75f60fc5 <OpenProfileUserMapping+23877> 0x75f60262 <OpenProfileUserMapping+20450>: mov 0x75f6e4ac,%eax 0x75f60267 <OpenProfileUserMapping+20455>: xor %ebp,%eax 0x75f60269 <OpenProfileUserMapping+20457>: mov %eax,-0x4(%ebp) 0x75f6026c <OpenProfileUserMapping+20460>: push %ebx 0x75f6026d <OpenProfileUserMapping+20461>: mov 0x8(%ebp),%ebx NOTE: 75f6e4ac is a PLT entry jumping to 0x774b5f28 (gdb) x/30i 'ntdll!_chkstk' 0x774b5f28 <ntdll!_chkstk>: push %ecx 0x774b5f29 <ntdll!_chkstk+1>: lea 0x4(%esp),%ecx 0x774b5f2d <ntdll!_chkstk+5>: sub %eax,%ecx 0x774b5f2f <ntdll!_chkstk+7>: sbb %eax,%eax 0x774b5f31 <ntdll!_chkstk+9>: not %eax 0x774b5f33 <ntdll!_chkstk+11>: and %eax,%ecx 0x774b5f35 <ntdll!_chkstk+13>: mov %esp,%eax 0x774b5f37 <ntdll!_chkstk+15>: and $0xfffff000,%eax 0x774b5f3c <ntdll!_chkstk+20>: cmp %eax,%ecx 0x774b5f3e <ntdll!_chkstk+22>: jb 0x774b5f4a <ntdll!_chkstk+34> 0x774b5f40 <ntdll!_chkstk+24>: mov %ecx,%eax 0x774b5f42 <ntdll!_chkstk+26>: pop %ecx 0x774b5f43 <ntdll!_chkstk+27>: xchg %eax,%esp 0x774b5f44 <ntdll!_chkstk+28>: mov (%eax),%eax 0x774b5f46 <ntdll!_chkstk+30>: mov %eax,(%esp) 0x774b5f49 <ntdll!_chkstk+33>: ret 0x774b5f4a <ntdll!_chkstk+34>: sub $0x1000,%eax 0x774b5f4f <ntdll!_chkstk+39>: test %eax,(%eax) 0x774b5f51 <ntdll!_chkstk+41>: jmp 0x774b5f3c <ntdll!_chkstk+20> 0x774b5f53 <ntdll!_chkstk+43>: nop 0x774b5f54 <ntdll!_chkstk+44>: nop 0x774b5f55 <ntdll!_chkstk+45>: nop 0x774b5f56 <ntdll!_chkstk+46>: nop 0x774b5f57 <ntdll!_chkstk+47>: nop 0x774b5f58 <ntdll!_chkstk+48>: nop 0x774b5f59 <ntdll!_chkstk+49>: nop |
Another link talking about SEH double-checking the stack pointer. http://sysexit.wordpress.com/2014/02/12/a-brief-reverse-engineering-note-on-structured-exception-handling-after-stack-pivoting/ |
Thanks for the pointer to setstacklimits. I was trying to find it earlier but misremembered it being in the 386 port. I tried setting the bounds in install_exception_handler and that produced reliable crashes much earlier than before. I think the problem is that Windows uses those numbers to decide when it is okay to allocate more stack on demand, so if they are lying and Windows faults on the OS stack, Windows thinks something terrible happened instead of just growing the stack. We could flip them back when running on the OS stack, of course, but that's a more involved change to try. If we find something like this that fixes the divmod problem, then maybe we just back away slowly and call it fixed, but I'm still a bit worried about the inversion of memory addresses along the SEH chain. I expect that is involved somehow, but perhaps I'm wrong. This is one of those times when on every other system you can just go to lxr/fxr and look at the source code. Oh well. |
[2] http://blogs.msdn.com/b/cbrumme/archive/2003/10/01/51524.aspx looks like what we need. We might even use it for both 386 / amd64. Is it something we want to try and do now? Alex |
Sorry, Russ, I meant to say we should use AddVectoredExceptionHandler just like in [3] http://msdn.microsoft.com/en-us/magazine/cc301714.aspx We should be able to use it for for both 386 and amd64. Define handler once per program: will work for all threads and callbacks. Hopefully we don't have to worry about AddVectoredExceptionHandler stack, like we do here. Maybe we can even leave most of current SEH code alone. It can get called for win2k. It is broken, but it was broken before. Alex |
Using AddVectoredExceptionHandler sounds good to me. If it works I'd like to remove the SEH code instead of leaving it for win2k. It won't be tested and it has been a source of problems in the past (and continues to be). Much better to pull it out and stop wasting our development resources maintaining code for systems that even Microsoft doesn't support anymore. |
A diff related to https://golang.org/cl/74790043/ Alex Attachments:
|
This issue was closed by revision 3750904. Status changed to Fixed. |
This issue was closed.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Attachments:
The text was updated successfully, but these errors were encountered: