Assertion failure in libGFXShared.dylib crashes host app purely depending on timing

Originator:phil
Number:rdar://32676454 Date Originated:June 9 2017, 6:42 PM
Status:Open Resolved:
Product:macOS+SDK/Core Graphics Product Version:10.12.5, 10.11.6, probably others
Classification:Crash Reproducible:always
 
Summary:
We have been seeing crashes of a launch daemon caused by the following failed assertions:

Assertion failed: (gfx_plugin_data.service[i] == gfx_plugin_data.service[i - 1]), function validate_plugin_data, file /Library/Caches/com.apple.xbs/Sources/OpenGL/OpenGL-14.0.16/GFXShared/gfx_plugin.c, line 1493.

Assertion failed: (gfx_plugin_data.fbindex[i] == gfx_plugin_data.fbindex[i - 1] + 1), function validate_plugin_data, file /Library/Caches/com.apple.xbs/Sources/OpenGL/OpenGL-12.1/GFXShared/gfx_plugin.c, line 1417.

It turns out the calling code has no influence on whether these crashes occur - it depends purely on system activity outside the process. Unfortunately, it still takes down our process.

The assertion failures occur deep below an innocuous call to clGetDeviceIDs(), although I'm told they've also turned up in OpenGL code:
0   libsystem_kernel.dylib              0x00007fffaae62d42 __pthread_kill + 10
1   libsystem_pthread.dylib             0x00007fffaaf50457 pthread_kill + 90
2   libsystem_c.dylib                   0x00007fffaadc8420 abort + 129
3   libsystem_c.dylib                   0x00007fffaad8f893 __assert_rtn + 320
4   libGFXShared.dylib                  0x00007fff99638e64 gfxLoadPluginData + 1160
5   com.apple.opencl                    0x00007fff995d66ff 0x7fff995b6000 + 132863
6   libsystem_pthread.dylib             0x00007fffaaf4db8c __pthread_once_handler + 65
7   libsystem_platform.dylib            0x00007fffaaf42ac1 _os_once + 36
8   libsystem_pthread.dylib             0x00007fffaaf4db2b pthread_once + 57
9   com.apple.opencl                    0x00007fff995d64da 0x7fff995b6000 + 132314
10  com.apple.opencl                    0x00007fff995d6b66 clGetDeviceIDs + 182

It's taken me a while to work out a way to reproduce the problem, but I've now narrowed it down to display configuration changes. It seems that the function gfxLoadPluginData() in /System/Library/Frameworks/OpenGL.framework/Versions/A/Libraries/libGFXShared.dylib enumerates/queries various IORegistry nodes (IOFramebuffer, IOAccelerator subclass objects?) associated with active displays. It looks like the display IDs of active displays might be used to find these. Display reconfigurations (hot-plug, hot-unplug of displays) change the set of valid display IDs. So if a display configuration change occurs precisely while this function executes, it causes an internal inconsistency, triggering the assert.

Working around this is difficult, as the circumstances are entirely outside the control of our code, and somehow trying to keep the process alive despite the failed assertion probably would just lead to other gnarly state issues.


Steps to Reproduce:
1. Prepare a spare external display, in addition to a Mac with at least 1 display attached (the latter can be MacBook/iMac built-in display)
2. Extract the attached zip file on the Mac
3. Open the Xcode project, build and run the sole CLI application target it contains (gfxpluginassert)
4. While the gfxpluginassert process is running, plug the external display into the Mac.


Expected Results:
The process should exit cleanly after printing some number of "reconfigured" lines.


Observed Results:
About 4 times out of 5, the process will crash with one of the aforementioned failed assertions within clGetDeviceIDs().


Version:
10.12.5 (16F73) and 10.11.6 (15G1217) are confirmed to be affected. Other versions also likely.


Notes:
I realise the repro code is contrived. The obvious fix there is: don't call clGetDeviceIDs() inside the CGDisplay reconfiguration callback. The launch daemon where we're seeing this in practice however is launched through I/O Kit XPC events which correlate with display attachments. The other option there would be to have the launch daemon always loaded on boot, not just when the device it drives is plugged in. This is not just inelegant and wasteful of system resources, it also poses the problem that GPU acceleration is unavailable during early boot.

Assuming my understanding of the underlying problem is correct, the gfxLoadPluginData() function should probably just start over with whatever it does if it detects changes to the display configuration during its runtime.

Let me know if there's any other information that would be of use, thanks!


Configuration:
As this is a timing related issue, the repro is probably more reliable on some Macs than others. I can easily reproduce the crash on a late 2012 Quad-core 2.6GHz Mac Mini running 10.12.5, and an early 2015 13" Retina MacBook Pro, 3.1GHz running 10.11.6. The QA engineers who originally found the problem did so on a MacbookAir5,1.

Comments


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!