PTHREAD_CANCEL_DISABLE does not prevent pthread_cancel from cancelling thread

Originator:Per.Mildner.usenet
Number:rdar://9137682 Date Originated:15-Mar-2011 10:28 PM
Status:Open Resolved:
Product:Mac OS X Product Version:10.8 (12A269), 10.6.6 10J567
Classification:Crash/Hang/Data Loss Reproducible:Always
 
### Added new information November 3, 2016
Still happens in macOS 10.12. As far as I understand this means that macOS 10.12 does not conform to the Single Unix Specification.

### Added new information 24-Oct-2013 00:44 AM
Still happens in OS X 10.9.

(foo.c in the transcript is the program in this bug report)
bash-3.2$ cc -g -Wall -Werror -pthread foo.c && ./a.out 100000000 10
Starting test, 100000000 iterations, sleep interval 10ms
foo.c:46: ERROR cancelled while PTHREAD_CANCEL_DISABLE
Illegal instruction: 4
bash-3.2$ cc --version
Apple LLVM version 5.0 (clang-500.2.78) (based on LLVM 3.3svn)
Target: x86_64-apple-darwin13.0.0
Thread model: posix
bash-3.2$ file ./a.out
./a.out: Mach-O 64-bit executable x86_64
bash-3.2$ bash-3.2$ uname -a
Darwin Ps-iMac.local 13.0.0 Darwin Kernel Version 13.0.0: Thu Sep 19 22:22:27 PDT 2013; root:xnu-2422.1.72~6/RELEASE_X86_64 x86_64

### Original report

Summary:
A thread that has disabled cancellation with PTHREAD_CANCEL_DISABLE can still get cancelled when another thread does pthread_cancel()

Steps to Reproduce:
See the below transcript. It may take hours for it to happen with this code but it happens far too often in our real code.

Expected Results:
A call to write() should not be cancelled while PTHREAD_CANCEL_DISABLE is in effect.

Actual Results:

The cleanup handler is called, despite PTHREAD_CANCEL_DISABLE being in effect (and it intentionally calls abort() to expose the bug).

Regression:

Notes: The following transcript shows the code and how it was run and some information about the machine. This is problem has been present in earlier versions of 10.6.x too and on other hardware. (I have updated XCode to 3.2.6 today before collecting the System Profiler report so the XCode version used in the transcript was the one before 3.2.6)

bash$ gcc -g -Wall -Werror -pthread cancel_leak.c && ./a.out 100000000 10
Starting test, 100000000 iterations, sleep interval 10ms
cancel_leak.c:46: ERROR cancelled while PTHREAD_CANCEL_DISABLE
Abort trap
bash$ uname -a
Darwin big-brothers-imac.local 10.6.0 Darwin Kernel Version 10.6.0: Wed Nov 10 18:13:17 PST 2010; root:xnu-1504.9.26~3/RELEASE_I386 i386
bash$ sw_vers
ProductName:	Mac OS X
ProductVersion:	10.6.6
BuildVersion:	10J567
bash$ gcc --version
i686-apple-darwin10-gcc-4.2.1 (GCC) 4.2.1 (Apple Inc. build 5664)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

bash$ cat cancel_leak.c
/*
 gcc -g -Wall -Werror -pthread cancel_leak.c && ./a.out 100000000 10
*/

#include <unistd.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>

#define PTHREAD_CHECK(X) do{                                            \
   int pthread_result = (X);                                           \
   if (pthread_result != 0)                                            \
     {                                                                 \
       fprintf(stderr, "%s:%d: ERROR code %d from pthread operation\n", \
               __FILE__, (int)__LINE__,                                \
               (int)pthread_result);                                   \
       fflush(stderr);                                                 \
       abort();                                                        \
     }                                                                 \
 } while(0)

#define CLIB_CHECK(X) do{                                               \
   int clib_result = (X);                                              \
   if (clib_result == -1)                                              \
     {                                                                 \
       int errno_value = errno;                                        \
       fprintf(stderr, "%s:%d: ERROR errno %d from clib operation\n",  \
               __FILE__, (int)__LINE__,                                \
               (int)errno_value);                                      \
       fflush(stderr);                                                 \
       abort();                                                        \
     }                                                                 \
 } while(0)

struct thread_cookie {
 int wait_fd;
 int write_fd;
};

/* Should never be called on a POSIX compliant system */
static void cleanup_routine(void *arg)
{
 (void)arg;
 fprintf(stderr, "%s:%d: ERROR cancelled while PTHREAD_CANCEL_DISABLE\n",
         __FILE__, (int)__LINE__);
 fflush(stderr);
 abort();
}

static size_t cookie_wait_fd_read_attempts = 0;
static size_t cookie_wait_fd_read = 0;
static size_t cookie_write_fd_written= 0;
static size_t pipefd_a_written = 0;

/* RUN IN SEPARATE THREAD */
static void *pthread_start_routine(void *vcookie)
{
 struct thread_cookie *pcookie = (struct thread_cookie*)vcookie;
 int oldstate;
 char buf[1];
 {
   ssize_t result;
   PTHREAD_CHECK(pthread_setcancelstate(PTHREAD_CANCEL_ENABLE, &oldstate));
   cookie_wait_fd_read_attempts++;
   do {
     CLIB_CHECK(result = read(pcookie->wait_fd, &buf[0], 1));
     if (result < 1)
       {
         fprintf(stderr, "%s:%d ERROR read only %ld bytes\n", __FILE__, (int)__LINE__, (long)result);fflush(stderr);
       }
   } while (result < 1);

   cookie_wait_fd_read++;

   PTHREAD_CHECK(pthread_setcancelstate(PTHREAD_CANCEL_DISABLE, &oldstate));
   /* PTHREAD_CANCEL_DISABLE should prevent cancel here, but it does not. */
   pthread_cleanup_push(cleanup_routine, NULL);
   do {
     CLIB_CHECK(result = write(pcookie->write_fd, &buf[0], 1));
     if (result < 1)
       {
         fprintf(stderr, "%s:%d ERROR wrote only %ld bytes\n", __FILE__, (int)__LINE__, (long)result);fflush(stderr);
       }
   } while (result < 1);
   cookie_write_fd_written++;
   pthread_cleanup_pop(0);     /* uninstall and do not call cleanup handler */
 }




 return NULL;
}

static void bad_usage(void)
{
 fprintf(stderr, "Usage: a.out ITERATIONS [MS SLEEP]\n");fflush(stderr);
 exit(EXIT_FAILURE);
}

int main(int argc, char **argv)
{
 pthread_t thread;
 void *thread_return = NULL;
 struct timespec nanosleep_interval;
 struct timespec nanosleep_rem;
 long ms_interval = 200;
 struct thread_cookie cookie;
 int pipefd_a[2];
 int pipefd_b[2];
 long iterations;
 (void)argc; (void)argv;
 if (argc != 2 && argc != 3)
   {
     bad_usage();
     exit(EXIT_FAILURE);
   }
 iterations = strtol(argv[1], NULL, 10);
 if (iterations <= 0)
   {
     bad_usage();
   }
 if (argc == 3)
   {
     ms_interval = strtol(argv[2], NULL, 10);
     if (ms_interval < 0)
       {
         bad_usage();
       }
   }

 fprintf(stderr, "Starting test, %ld iterations, sleep interval %ldms\n", (long)iterations, (long)ms_interval);fflush(stderr);
 nanosleep_interval.tv_sec = 0;
 nanosleep_interval.tv_nsec = 1000*1000*ms_interval;
 while (iterations-- > 0)
   {
     CLIB_CHECK(pipe(pipefd_a));
     CLIB_CHECK(pipe(pipefd_b));
     cookie.wait_fd = pipefd_a[0]; /* input */
     cookie.write_fd = pipefd_b[1]; /* output */
     PTHREAD_CHECK(pthread_create(&thread,NULL,&pthread_start_routine, &cookie));
     if (ms_interval > 0)
       {
         CLIB_CHECK(nanosleep(&nanosleep_interval, &nanosleep_rem));
       }
     /* Expects thread to wait for wait_fd now */
     {
       char data[1] = { 'X' };
       ssize_t result;
       do {
       CLIB_CHECK(result = write(pipefd_a[1], &data[0], 1));
       if (result < 1)
         {
           fprintf(stderr, "%s:%d ERROR wrote only %ld bytes\n", __FILE__, (int)__LINE__, (long)result);fflush(stderr);
         }
       } while (result < 1);
       pipefd_a_written++;

       /* Expects thread to wake up soon and write to cookie.write_fd */
     }
     /* This may (and is allowed to) cancel thread while it waits for
        cookie.wait_fd. The bug manifests itself it this cancels the
        thread while it writes to cookie.write_fd */
     PTHREAD_CHECK(pthread_cancel(thread));
     PTHREAD_CHECK(pthread_join(thread, &thread_return));
     CLIB_CHECK(close(pipefd_a[0]));
     CLIB_CHECK(close(pipefd_a[1]));
     CLIB_CHECK(close(pipefd_b[0]));
     CLIB_CHECK(close(pipefd_b[1]));
   }
 exit(EXIT_SUCCESS);
}
bash$

05-Aug-2012 12:01 PM Per Mildner:
Still happens in OS X 10.8 on the same iMac as in my original
bug-report, both 64-bit and 32-bit:

bash$ gcc -g -Wall -Werror -pthread cancel_leak.c && ./a.out 100000000 10
Starting test, 100000000 iterations, sleep interval 10ms
cancel_leak.c:46: ERROR cancelled while PTHREAD_CANCEL_DISABLE
Illegal instruction: 4
bash$ gcc --version
i686-apple-darwin11-llvm-gcc-4.2 (GCC) 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00)
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

bash$ uname -a
Darwin imac 12.0.0 Darwin Kernel Version 12.0.0: Sun Jun 24 23:00:16 PDT 2012; root:xnu-2050.7.9~1/RELEASE_X86_64 x86_64

Also happens with cc (i.e. clang) on OS X 10.8:

bash$ cc -g -Wall -Werror -pthread cancel_leak.c && ./a.out 100000000 10
Starting test, 100000000 iterations, sleep interval 10ms
cancel_leak.c:46: ERROR cancelled while PTHREAD_CANCEL_DISABLE
Illegal instruction: 4
bash$ cc --version
Apple clang version 4.0 (tags/Apple/clang-421.0.57) (based on LLVM 3.1svn)
Target: x86_64-apple-darwin12.0.0
Thread model: posix

Also happens in 32-bit code on OS X 10.8:

bash$ cc -m32 -g -Wall -Werror -pthread cancel_leak.c && ./a.out 100000000 10
Starting test, 100000000 iterations, sleep interval 10ms
cancel_leak.c:74: ERROR cancelled while PTHREAD_CANCEL_DISABLE
Illegal instruction: 4
bash$ file a.out
a.out: Mach-O executable i386

Comments

Still happens in macOS 10.15 (19A602) Catalina.

It takes anything from less than a second to several minutes for the bug to manifest itself with the above test program and the parameters in the transcript.

By Per.Mildner.usenet at Oct. 21, 2019, 11:37 a.m. (reply...)

Still happens in macOS 10.13 (High Sierra). As far as I understand this means that macOS 10.13 does not conform to the Single Unix Specification.

By Per.Mildner.usenet at Sept. 26, 2017, 6:57 a.m. (reply...)

Still broken in macOS Sierra 10.12.6

The NDA prevents me from discussing what happens in the High Sierra betas.

By Per.Mildner.usenet at Aug. 14, 2017, 5:03 p.m. (reply...)

Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!