Write to system register S3_6_C15_C1_5 (aka. SPRR_PERM_EL0 ) in pthread_jit_write_protect_np can fail rarely

Number:rdar://FB10500605 Date Originated:06/29/2022
Status:Fixed Resolved:
Product:macOS Product Version:
Classification: Reproducible:
We observe a rare crash (SIGTRAP) in pthread_jit_write_protect_np() in our CI setup for Truffleruby [1]. It’s using HotSpot (OpenJDK), this is what the stack trace looks like:

    Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
    0   libsystem_pthread.dylib         0x000000018515c6f0 pthread_jit_write_protect_np + 516
    1   libjvm.dylib                    0x000000010119e394 Threads::create_vm(JavaVMInitArgs*, bool*) + 140
    2   libjvm.dylib                    0x0000000100d5aa04 JNI_CreateJavaVM + 120
    3   ruby                            0x0000000100699260 main + 892
    4   libdyld.dylib                   0x0000000185179430 start + 4

Disassembling pthread_jit_write_protect_np() we see the following instruction at offset 516:

(lldb) dis -n pthread_jit_write_protect_np
    0x19bb34f5c <+0>:   pacibsp
    0x19bb34f60 <+4>:   stp    x29, x30, [sp, #-0x10]!
    0x19bb34f64 <+8>:   mov    x29, sp
    0x19bb35160 <+516>: brk    #0x1

where brk causes a SIGTRAP.  So how do we get here?

    0x19bb34fe4 <+136>: movk   x0, #0xc110
    0x19bb34fe8 <+140>: movk   x0, #0xffff, lsl #16
    0x19bb34fec <+144>: movk   x0, #0xf, lsl #32
    0x19bb34ff0 <+148>: movk   x0, #0x0, lsl #48
    0x19bb34ff4 <+152>: ldr    x0, [x0]
    0x19bb34ff8 <+156>: msr    S3_6_C15_C1_5, x0
    0x19bb34ffc <+160>: isb
    0x19bb35000 <+164>: movk   x1, #0xc110
    0x19bb35004 <+168>: movk   x1, #0xffff, lsl #16
    0x19bb35008 <+172>: movk   x1, #0xf, lsl #32
    0x19bb3500c <+176>: movk   x1, #0x0, lsl #48
    0x19bb35010 <+180>: ldr    x9, [x1]
    0x19bb35014 <+184>: mrs    x10, S3_6_C15_C1_5
    0x19bb35018 <+188>: b      0x19bb350ac               ; <+336>
    0x19bb350ac <+336>: cmp    x9, x10
    0x19bb350b0 <+340>: b.ne   0x19bb35160               ; <+516>
    0x19bb35160 <+516>: brk    #0x1

So the verification fails, and thus a jump to brk happens.

We managed to replace pthread_jit_write_protect_np() with a custom implementation that retries writing to S3_6_C15_C1_5 until successful. However, it looks like a context switch must happen between retries; I guess whatever the kernel is doing this helps to “fixup” the situation.

Here is our workaround for HotSpot with more details on the issue: https://gist.githubusercontent.com/lewurm/3ae189f55de13621708aefb52d12fe1d/raw/09f70b66d91c7961b7229f9be3f76ac355c95bf4/jit-protect.patch

Unfortunately we have not been able to come up with a reproducer outside of our CI setup.

We are observing this on Macmini9,1 machines.

[1] https://github.com/oracle/truffleruby/


It got fixed in macOS 13.0 Beta 3 (22A5295i) and was a race in the process setup by the kernel.

Some hints can be found in a XNU source bump: https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/bsd/kern/kern_exec.c#L4081-L4099

This is also tracked as rdar://96307913

I managed to come up with a reproducer that also works on macOS 12.4 on a M1 Pro: https://gist.github.com/lewurm/40fb8f7edb81f5e715ee6c7217feed32

Crash report: https://gist.github.com/lewurm/74f5a8c2b2291756e64b49070aea68d7

