mbrtowc() improperly converts ill-formed UTF-8 sequences without giving EILSEQ

Originator:winstein
Number:rdar://10952550 Date Originated:29-Feb-2012 03:00 AM
Status:Open Resolved:
Product:Mac OS X Product Version:10.7
Classification:Security Reproducible:Always
 
29-Feb-2012 03:00 AM Keith Winstein:
Summary:

In the default en_US.UTF-8 locale, the mbrtowc() and similar functions are too liberal in accepting ill-formed UTF-8 sequences. This may have security implications

Steps to Reproduce:

Using attached file unicode-test2.c, run:

$ cc -std=c99 -g -Wall -o unicode-test2 unicode-test2.c && ./unicode-test2

Expected Results:

The program should exit cleanly with no output and 0 status. (It does this on GNU libc.)

Actual Results:

Assertion failed: ((num == (size_t) -1) && (errno == EILSEQ)), function main, file unicode-test2.c, line 33.
Abort trap: 6

Regression:

Notes:

Unicode 6.0, D92, says that UTF-8 sequences that would otherwise encode surrogate code points are ill-formed. (ISO/IEC 10646 agrees, using the slightly different terminology of that specification.) Therefore, mbrtowc() and other routines that purport to interpret UTF-8 should return EILSEQ.

Instead, this implementation of mbrtowc() produces an invalid wchar_t that does not contain a Unicode scalar value, and does not signal an invalid sequence.

Such invalid characters (isolated surrogate code points) may lead to security implications because an attacker could use the ability to generate adjacent surrogate code points, leading to a real Unicode character, while evading input filters.


29-Feb-2012 03:00 AM Keith Winstein:
'unicode-test2.c' was successfully uploaded

#include <wchar.h>
#include <stdio.h>
#include <locale.h>
#include <string.h>
#include <langinfo.h>
#include <stdlib.h>
#include <assert.h>
#include <errno.h>

int main( void )
{
  /* Adopt native locale */
  if ( NULL == setlocale( LC_ALL, "" ) ) {
    perror( "setlocale" );
    return 1;
  }

  /* Verify that locale calls for UTF-8 */
  if ( strcmp( nl_langinfo( CODESET ), "UTF-8" ) != 0 ) {
    fprintf( stderr, "mosh requires a UTF-8 locale.\n" );
    return 1;
  }

  char s[ 3 ] = { 0xED, 0xA0, 0x81 };
  /* Ill-formed per Unicode 6.0, D92 because it maps to surrogate code point. */

  mbstate_t ps;
  memset( &ps, 0, sizeof( ps ) );

  wchar_t y = 0;
  size_t num = mbrtowc( &y, s, 3, &ps );

  assert( (num == (size_t) -1) && (errno == EILSEQ) );

  return 0;
}

Comments


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!