String conversion can return nil when canBeConverted returns true

Originator:robnapier
Number:rdar://29340256 Date Originated:18-Nov-2016 06:21 PM
Status:Open Resolved:
Product:iOS SDK Product Version:Xcode 8.1
Classification: Reproducible:
 
Summary:
It is expected that String.canBeConverted returning true should always correspond to [NSString dataUsingEncoding:] returning non-nil, however there are cases where this is not true. If you construct a UTF-16 string with only a first surrogate character and try to convert it to UTF-8, String.canBeConverted will be true, but -dataUsingEncoding: will be nil. This is either a bug in one or the other method (and possibly also in the method that allowed this string to be generated; I'm not certain if it's valid).

Furthermore, string.data(using:.utf8) is nil when string.utf8 is not. This is surprising at least.

Steps to Reproduce:
        const unsigned char bytes[] = {0xDC, 0};
        NSData *data = [NSData dataWithBytes:bytes length:2];
        NSString *string = [[NSString alloc] initWithData:data encoding: NSUTF16StringEncoding];
        NSLog(@"%@", string); // \udc00
        NSLog(@"%d", [string canBeConvertedToEncoding:NSUTF8StringEncoding]); // 1
        NSData *utf8 = [string dataUsingEncoding:NSUTF8StringEncoding];
        NSLog(@"%@", utf8); // (null)
        NSLog(@"%s", string.UTF8String); // ""
        NSLog(@"%d", string.UTF8String == nil); // 1

Note that the Swift version of this is also incorrect, but different:

let lowSurrogateFirst = Data([0xDC, 0])
let string = String(bytes: lowSurrogateFirst, encoding: .utf16BigEndian)! // replacement-character
string.canBeConverted(to: .utf8) // true
string.data(using: .utf8) // nil
string.utf8 // replacement character
Data(string.utf8) as NSData // <efbfbd> (replacement character)


Expected Results:
At a minimum, canBeConverted should always be false if data(using:) will return to be nil.

It's unclear whether NSString or String is more correct in how they accept this string of bytes. Since this sequence is not valid UTF-16, I would expect init to fail rather than injecting a replacement character.

Comments


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!