|
UberBug
Discussion page for the current bug.
ProblemWith timer interrupts enabled, registers r0 and r3 occasionally become clobbered with the return code from the previous system call. To reproduceRun param_test.elf with a terminal at 115.2 kbps connected to COM1 (instead of the train). DetailsBased on our debugging efforts so far, the following appear to be true:
Theoriesevan.stratfordFrom the above, this is a likely control flow of the bug: ParamTest() system call executes successfully and jumps to after swi r0 contains return value for ParamTest() mov r3, r0 is executed timer interrupt TimerHandler() wakes up ClockNotifier() ClockNotifier() notifies ClockServer() and waits for next interrupt ClockServer() registers tick and waits for another notification // stuff happens... user program returns from timer interrupt at top of ParamTest() // at this point, r3 and r0 contain the previous return value software interrupt from ParamTest() Handle() calls SysParamTest() SysParamTest() notices corrupted parameters, prints error It is unclear what happens afterwards: why does it sometimes hang, and why does it sometimes loop forever inside of ParamTest()? There are a few plausible explanations:
Post-MortemWe tracked down this bug to the following snippet within context_switch.s: KernelEnter:
str lr, shared_lr @ shared_lr gets lr_svc
msr cpsr_c, #0x1f @ switch to system mode
ldr lr, shared_lr @ ERROR! lr_usr gets lr_svc
stmfd sp!, {r0-r12,lr} @ push user state onto user stackYou read that right: we were overwriting the user mode link register with the supervisor mode link register, which contains the user mode instruction after the interrupt (or this instruction plus 4, in the case of a hardware interrupt). So why did it work at all until now? GCC does some special stack magic at the top and bottom of each function call: mov ip, sp
@ ERROR: critical location!
stmfd sp!, {fp, ip, lr, pc}
@ rest of function code
ldmfd sp, {fp, sp, pc}Since GCC saves the link register to the stack, this problem remained hidden for the system call assignments. Enter hardware interrupts. If a hardware interrupt occurs before the store-multiple instruction, the stack copy of the link register is corrupted as well and you enter a very strange kind of infinite loop. To make matters worse, attempting to debug the problem with terminal output decreases the probability that a timer interrupt will happen at the critical location above - yes, this is a Heisenbug. What did we learn in debugging this? Our initial efforts were all over the map - we checked the name server, the I/O interrupt logic, even the various kernel data structures. After two or three days of poking around, we came to the conclusion that should have been immediate: nondeterministic problems suggest low-level causes. (Remember that neither of us have prior hardware experience!) Let this be a lesson to aspiring Real-Timers everywhere. |