Random SIGILL on ARM board (Odroid-UX4) with GDB/GDBServer

This is a bug that plagued my development efforts on ARM since the beginning, basically what happens is that randomly GDBServer will fail at running a test with a SIGILL.

This causes regression testing to be very hard, has it generates a lot of false positives due to this error.

I posted to the GDB mailing list about it a while ago see: https://www.sourceware.org/ml/gdb/2015-11/msg00030.html

For a long time, I had no idea of what the problem could be, I had the intuition that there was some memory race there but the details of it were out of my range of expertise.

Then one day, while at a workshop I discussed the issue a bit with Mathieu Desnoyer (LTTng maintainter), and he hinted that this may be due to ARM having a instruction cache and a separate data cache and that there could be some inconsistencies there.

Now knowing a bit more about what to look for I searched for issues around this area. I found this entry: https://community.arm.com/groups/processors/blog/2010/02/17/caches-and-self-modifying-code That explained it well for self modifying code.

But my code was modified by ptrace so I started to look at flush_ptrace_access in the kernel.

With more research around this I found this patch: http://lists.infradead.org/pipermail/linux-arm-kernel/2010-July/020949.html Which was quite similar to my problem!

Yet I understood very little about the issue, so I decided to contact the author of the patch, Will Deacon.

I exposed the problem to him and luckily he identified the problem quite quickly!!

I tested the solution and it fixed the problem on my Odroid UX4 so it turns at that these SIGILL are due to a SoC Hardware bug as described by Will as such :

“So the problem is that A15 has 64-byte I-cache lines and A7 has 32-byte I-cache lines. That means that if the kernel is doing cache maintenance on the A15, it will issue an invalidation for each 64 bytes of the region it wants to invalidate. The A7 will then receive these invalidation messages, but only invalidate 32 bytes for each one, leaving a bunch of 32-byte holes that didn’t get invalidated.

This is an SoC hardware bug. The two cores should report the same line size (as I mentioned, there’s a tie-off on the A15 to make it report 32-byte cachelines).”

The tie-off documentation can be found in the Technical Reference Manual under “Configuration signals” as IMINLN.

I’ve sent via their Forum the issue to the HardKernel team, I hope they can fix it in their kernel, and in their next hardware board…

Here’s the patch:

diff --git a/arch/arm/mm/proc-macros.S b/arch/arm/mm/proc-macros.S
index ee1d805..573db9b 100644
--- a/arch/arm/mm/proc-macros.S
+++ b/arch/arm/mm/proc-macros.S
@@ -82,10 +82,7 @@
  * on ARMv7.
        .macro  icache_line_size, reg, tmp
-       mrc     p15, 0, \tmp, c0, c0, 1         @ read ctr
-       and     \tmp, \tmp, #0xf                @ cache line size encoding
-       mov     \reg, #4                        @ bytes per word
-       mov     \reg, \reg, lsl \tmp            @ actual cache line size
+       mov     \reg, #32                       @ hack

So I can now do proper regression testing with buildbot!!

A very big thanks to Will for helping me out with this!