psych-os - UberBug.wiki

Problem

With timer interrupts enabled, registers r0 and r3 occasionally become clobbered with the return code from the previous system call.

To reproduce

Run param_test.elf with a terminal at 115.2 kbps connected to COM1 (instead of the train).

Details

Based on our debugging efforts so far, the following appear to be true:

Turning off timer interrupts causes the problem to disappear.
Speeding up the timer interrupts by writing a lower value to timer3->load causes the problem to happen sooner. This strongly suggests that timer interrupts are somehow responsible.
We've fixed a couple of issues relating to overwriting registers during an IRQ mode context switch. This does not fix the problem.
The registers are clobbered before we enter the kernel; from the kernel's perspective, we are getting the correct parameters.
Link registers, etc. are being preserved properly across the context switch. Along with the previous item, this suggests that the context switch is working as expected.
When it freezes as a result, timer3->value is usually near the top of its countdown (23969 in all runs, compared to maximum of 25400).
When it freezes as a result, the previous timer interrupt seems to have the top of ParamTest as the link register.
It does not always freeze. When it does not freeze, the previous return value is repeatedly filled into r0 and r3. It has the correct return value, yet it fails to exit the loop in param_test.elf; this suggests that it is never exiting ParamTest().
Explicitly checking vic2->irq_status and calling TimerHandler() within the kernel does not fix the problem.
If we remove the three "compiler warning fix" lines below swi in the system call, the bug changes: it sometimes runs for a while, then aborts. When we remove these lines, the next assembly instruction decrements the frame pointer; it is likely (though not investigated) that this is running off to 0, causing the abort.

Theories

evan.stratford

From the above, this is a likely control flow of the bug: ParamTest() system call executes successfully and jumps to after swi r0 contains return value for ParamTest() mov r3, r0 is executed timer interrupt TimerHandler() wakes up ClockNotifier() ClockNotifier() notifies ClockServer() and waits for next interrupt ClockServer() registers tick and waits for another notification // stuff happens... user program returns from timer interrupt at top of ParamTest() // at this point, r3 and r0 contain the previous return value software interrupt from ParamTest() Handle() calls SysParamTest() SysParamTest() notices corrupted parameters, prints error

It is unclear what happens afterwards: why does it sometimes hang, and why does it sometimes loop forever inside of ParamTest()? There are a few plausible explanations:

Some part of the kernel is not handling interrupts correctly, and an interrupt is occurring when it should not.
The context switch is somehow either corrupting the link register or receiving the wrong link register to begin with.

Post-Mortem

We tracked down this bug to the following snippet within context_switch.s: KernelEnter: str lr, shared_lr @ shared_lr gets lr_svc msr cpsr_c, #0x1f @ switch to system mode ldr lr, shared_lr @ ERROR! lr_usr gets lr_svc stmfd sp!, {r0-r12,lr} @ push user state onto user stack

You read that right: we were overwriting the user mode link register with the supervisor mode link register, which contains the user mode instruction after the interrupt (or this instruction plus 4, in the case of a hardware interrupt). So why did it work at all until now? GCC does some special stack magic at the top and bottom of each function call: mov ip, sp @ ERROR: critical location! stmfd sp!, {fp, ip, lr, pc} @ rest of function code ldmfd sp, {fp, sp, pc}

Since GCC saves the link register to the stack, this problem remained hidden for the system call assignments. Enter hardware interrupts. If a hardware interrupt occurs before the store-multiple instruction, the stack copy of the link register is corrupted as well and you enter a very strange kind of infinite loop. To make matters worse, attempting to debug the problem with terminal output decreases the probability that a timer interrupt will happen at the critical location above - yes, this is a Heisenbug.

What did we learn in debugging this? Our initial efforts were all over the map - we checked the name server, the I/O interrupt logic, even the various kernel data structures. After two or three days of poking around, we came to the conclusion that should have been immediate: nondeterministic problems suggest low-level causes. (Remember that neither of us have prior hardware experience!) Let this be a lesson to aspiring Real-Timers everywhere.

Code