Write to system register S3_6_C15_C1_5 (aka. SPRR_PERM_EL0 ) in pthread_jit_write_protect_np can fail rarely
Originator: | lewurm | ||
Number: | rdar://FB10500605 | Date Originated: | 06/29/2022 |
Status: | Fixed | Resolved: | |
Product: | macOS | Product Version: | |
Classification: | Reproducible: |
We observe a rare crash (SIGTRAP) in pthread_jit_write_protect_np() in our CI setup for Truffleruby [1]. It’s using HotSpot (OpenJDK), this is what the stack trace looks like: Thread 0 Crashed:: Dispatch queue: com.apple.main-thread 0 libsystem_pthread.dylib 0x000000018515c6f0 pthread_jit_write_protect_np + 516 1 libjvm.dylib 0x000000010119e394 Threads::create_vm(JavaVMInitArgs*, bool*) + 140 2 libjvm.dylib 0x0000000100d5aa04 JNI_CreateJavaVM + 120 3 ruby 0x0000000100699260 main + 892 4 libdyld.dylib 0x0000000185179430 start + 4 Disassembling pthread_jit_write_protect_np() we see the following instruction at offset 516: (lldb) dis -n pthread_jit_write_protect_np libsystem_pthread.dylib`pthread_jit_write_protect_np: 0x19bb34f5c <+0>: pacibsp 0x19bb34f60 <+4>: stp x29, x30, [sp, #-0x10]! 0x19bb34f64 <+8>: mov x29, sp [...] 0x19bb35160 <+516>: brk #0x1 where brk causes a SIGTRAP. So how do we get here? 0x19bb34fe4 <+136>: movk x0, #0xc110 0x19bb34fe8 <+140>: movk x0, #0xffff, lsl #16 0x19bb34fec <+144>: movk x0, #0xf, lsl #32 0x19bb34ff0 <+148>: movk x0, #0x0, lsl #48 0x19bb34ff4 <+152>: ldr x0, [x0] 0x19bb34ff8 <+156>: msr S3_6_C15_C1_5, x0 0x19bb34ffc <+160>: isb 0x19bb35000 <+164>: movk x1, #0xc110 0x19bb35004 <+168>: movk x1, #0xffff, lsl #16 0x19bb35008 <+172>: movk x1, #0xf, lsl #32 0x19bb3500c <+176>: movk x1, #0x0, lsl #48 0x19bb35010 <+180>: ldr x9, [x1] 0x19bb35014 <+184>: mrs x10, S3_6_C15_C1_5 0x19bb35018 <+188>: b 0x19bb350ac ; <+336> [...] 0x19bb350ac <+336>: cmp x9, x10 0x19bb350b0 <+340>: b.ne 0x19bb35160 ; <+516> [...] 0x19bb35160 <+516>: brk #0x1 So the verification fails, and thus a jump to brk happens. We managed to replace pthread_jit_write_protect_np() with a custom implementation that retries writing to S3_6_C15_C1_5 until successful. However, it looks like a context switch must happen between retries; I guess whatever the kernel is doing this helps to “fixup” the situation. Here is our workaround for HotSpot with more details on the issue: https://gist.githubusercontent.com/lewurm/3ae189f55de13621708aefb52d12fe1d/raw/09f70b66d91c7961b7229f9be3f76ac355c95bf4/jit-protect.patch Unfortunately we have not been able to come up with a reproducer outside of our CI setup. We are observing this on Macmini9,1 machines. [1] https://github.com/oracle/truffleruby/
Comments
Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!
It got fixed in macOS 13.0 Beta 3 (22A5295i) and was a race in the process setup by the kernel.
Some hints can be found in a XNU source bump: https://github.com/apple-oss-distributions/xnu/blob/1031c584a5e37aff177559b9f69dbd3c8c3fd30a/bsd/kern/kern_exec.c#L4081-L4099
This is also tracked as rdar://96307913
I managed to come up with a reproducer that also works on macOS 12.4 on a M1 Pro: https://gist.github.com/lewurm/40fb8f7edb81f5e715ee6c7217feed32
Crash report: https://gist.github.com/lewurm/74f5a8c2b2291756e64b49070aea68d7