Characters with umlauts are copied wrong from PDF documents

Originator:nico.rohrbach
Number:rdar://28243946 Date Originated:11-Sep-2016
Status:Duplicate/23343765 Resolved:
Product:OS X Product Version:OS X 10.11.6 (Build 15G1004) and macOS 10.12 GM (Build 16A320)
Classification:Serious Bug Reproducible:Always
 
Summary:
If you copy text from a PDF document in Preview that contains characters with umlauts, for example "ä", this character will not be copied to the clipboard as Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (ID 228) but as the two Unicode Characters 'LATIN SMALL LETTER A' (ID 97) and 'COMBINING DIAERESIS' (ID 776).

This leads to problems like in the attached HTML file (demo.html), because in a text editor (like Xcode) that you use to create a webpage, you cannot see the difference between "ID 228" and "IDs 97 and 776" but a browser will render it differently.

Steps to Reproduce:
1. Create a new empty Document in TextEdit.
2. Type in the letter "ä" (Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (ID 228))
3. Export the document as PDF and open it in Preview.app.
4. Select the letter "ä" and copy it to the clipboard.

Expected Results:
The clipboard should contain the Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (ID 228).

Actual Results:
The clipboard actually contains the Unicode Characters 'LATIN SMALL LETTER A' (ID 97) and 'COMBINING DIAERESIS' (ID 776).

Version:
OS X 10.11.6 (Build 15G1004)
macOS 10.12 GM (Build 16A320)

Notes:
If you open the generated PDF document in Adobe Reader and copy the letter "ä", it will be copied to the clipboard correctly as the Unicode Character 'LATIN SMALL LETTER A WITH DIAERESIS' (ID 228).

Comments


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!