Crash in malloc after fork

Originator:bauerb
Number:rdar://9148548 Date Originated:17-Mar-2011 04:48 PM
Status:Open Resolved:
Product: Product Version:
Classification: Reproducible:
 
http://crbug.com/60426 is a bug happening on the Chromium build
waterfall.  In the course of the fork() implementation, the child
process crashes, even though the parent process never has a problem.
The specific cause is that the metadata which describes the length of
a freelist block is set to zero, which should not be possible.
Extensive prodding has been unable to find a proven malloc corruption
in the parent process to explain this.  In trying to trace backwards
to find where the zero is coming from, it APPEARS that the parent
process is coalescing a free'd block with the freelist, and when it
clears the metadata using small_meta_header_set_middle(), the child
process sees the cleared data.

That is clearly impossible!  And yet it moves, and I'm having troubles
seeing how my case is not demonstrating exactly what I describe above
happening.  I've also been unable to reduce to a simpler repro as of
yet.

Our waterfall uses Mac Mini machines.  I've been repro'ing this on one
like what is currently available at the Apple store (Core 2 Duo, 8G
RAM, 500G hard drive), but I think it was being seen on older models,
too.  It repros on 10.5 or 10.6, but 10.6 gives a better stack
backtrace.  See the bug for more color, if you like that kind of
thing.

shess@chromium.org

#################################################################

# Setup to build Chromium, per
# http://code.google.com/p/chromium/wiki/MacBuildInstructions

# Build unit_tests using Xcode, or something like this (I usually
# ssh to my captive trybot):
cd .../src
gclient sync --jobs 8
xcodebuild -project chrome/chrome.xcodeproj \
  -target unit_tests -parallelizeTargets -configuration Debug

# Demonstrate that the bug happens to you.
./repeat_default.sh
# No _child found
# No _child found
# ...
# No _child found
#   8   libSystem.B.dylib                   0x98cd69e5 pthread_workqueue_atfork_child + 48
#   9   libSystem.B.dylib                   0x98cbbcc5 _cthread_fork_child + 176
#   8   libSystem.B.dylib                   0x98cd69e5 pthread_workqueue_atfork_child + 48
#   9   libSystem.B.dylib                   0x98cbbcc5 _cthread_fork_child + 176
# Found a crash on iteration 26
#### Usually takes between 20 and 100 iterations.  The pthread/cthread
#### lines are from the stack backtrace, and are different on 10.5, so
#### my script won't work there.

# Once proven, set out to build an interpose library to add logging.

# These are the versions I happened to repro with.  I don't believe
# there are substantive changes in the other nearby versions.
LIBC=Libc-594.9.1
XNU=xnu-1504.7.4

# Get Libc and unpack it.
wget http://www.opensource.apple.com/tarballs/Libc/${LIBC}.tar.gz
tar -xvzf ${LIBC}.tar.gz

# xnu needed for dark magic.
wget http://www.opensource.apple.com/tarballs/xnu/${XNU}.tar.gz
tar -xvzf ${XNU}.tar.gz

# Apply some patches.  The patches were generated by git format-patch,
# in case anyone wants to see things in isolation to verify my work.
cd ${LIBC}
patch -p1 <../0001-Add-my_magazine_malloc.c.patch 
patch -p1 <../0002-Don-t-inline-small_malloc_should_clear.patch 
patch -p1 <../0003-Interpose-scalable_zone-interface.patch
patch -p1 <../0004-eprintf-to-help-with-logging.patch
patch -p1 <../0005-small_ptr_index-adds-a-logging-function-which-increm.patch 
patch -p1 <../0006-Add-logging-to-check-the-last-free-ptr-in-various-pl.patch 

# Hack the build environment to match what make would have.
mkdir out
ln -s ../../${XNU}/osfmk out/System
dtrace -o out/magmallocProvider.h -C -h -s gen/magmallocProvider.d

# Build a dylib.  my_magazine_malloc.c includes malloc.c and
# magazine_malloc.c, and interposes all over them.
# TODO(shess): I don't think it messes with malloc.c any longer, but
# I'm too lazy to fix that up right now.
gcc -Iout -Ipthreads -DPRIVATE -arch i386 -arch x86_64 -dynamiclib -std=gnu99 \
    -current_version 1.0 -compatibility_version 1.0 -fvisibility=hidden \
    ./gen/my_magazine_malloc.c -o /tmp/libMagazineMalloc.dylib

# Run things until failure, interposing the above dylib.
./repeat_interpose.sh

# The additional log lines in unit_tests_interpose.log are like:
#   pid[cpu#]/seq# msg
# The sequence number is two parts.  n%100 is from an incrementing
# global static, while n/100 is from an incrementing item tucked into
# the pad area of the small region the function sending the log is
# working with.

# The interesting part (just before the pthread/cthread logging) goes
# something like:
76440[0]/22457 last fp 0xd058000
# The following are the parent process preparing to fork by acquiring
# all the malloc locks.
76440[0] szone_force_lock
76440[0] szone_force_locked
76440[0] szone_force_lock
76440[0] szone_force_locked
76440[0] szone_force_lock
76440[0] szone_force_locked
# The following are in the parent post-fork, releasing those locks.
76440[0] szone_force_unlock
76440[0] szone_force_unlock
76440[0] szone_force_unlock
# The parent continues on, with both sequence numbers incrementing.
76440[0]/22558 coalesce 0xd04f800 next 0xd058000 delta 34816
76440[0]/22659 last fp 0xd058000
76440[0]/22760 sflrp 0xd058000
76440[0]/22861 last fp 0xd125400
76440[0]/22962 SMALL_PTR_SIZE(0xd058000) == 0 at 3929
76440[0]/23063 last fp 0xd125400
76440[0]/23164 sflap 0xd04f800
76440[0]/23265 last fp 0xd04f800
76440[0]/23366 last fp 0xd04f800
# The child comes back from the fork and releases the locks which were
# acquired by the parent in setting up the fork.
76442[0] szone_force_unlock
76442[0] szone_force_unlock
76442[0] szone_force_unlock
# Note that the n/100 sequence number continues from where the PARENT
# is at, while the n%100 sequence number is doing the right thing
# (continuing from where it was pre-fork).
76442[0]/23458 SMALL_PTR_SIZE(0xd058000) == 0 at 4493
76442[0]/23559 split 0xd058000, 0xd05ee00
# The invalid metadata is detected, the crash is coming.
pid 76442 soon crashing, ptr 0xd058000 meta 0xd7f8594
# Stack backtrace ...

Comments


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!