NSLinguisticTagger fails to separate numbers from nouns

Originator:phase.of.matter
Number:rdar://14190417 Date Originated:18-Jun-2013 01:34 PM
Status:Open Resolved:
Product:10.9 Product Version:13A476u
Classification:Other Bug Reproducible:Always
 
Summary:
In the first Mavericks beta, the NSLinguisticTagger fails to separate numbers following nouns in specific situations (did not happen in 10.8). In the following sentences the chunk "Creatinine 95" (and in some even "GFR 32") comes back as one token instead of two:

"GFR, 32 years, Creatinine 95, male, caucasian."
"GFR 32 years, Creatinine 95, male, caucasian."
"32 years, Creatinine 95, male, caucasian."

In the following sentences, "Creatinine" and "95" come back as their own tokens as expected:

"GFR, 32 years Creatinine 95, male, caucasian."
"GFR 32 years Creatinine 95, male, caucasian."
"GFR, Creatinine 95, male, caucasian."
"Creatinine 95, male, caucasian."

Expected Results:
If the "NSLinguisticTaggerJoinNames" option is not given, no name/noun joining should be attempted.

Actual Results:
Even if the "NSLinguisticTaggerJoinNames" is not given, the tagger seems to generate heuristics which prompts it to join tokens into nouns that it might think are names.

Regression:
The problem with the joining is that the tagger does not (and I don't expect it to) correctly parse this "sentence". It is thus important that it does not try to be smart and just returns the tokens. "95" is obviously a number, in no instance should it be joined to a noun if the join-names option is not given.

Comments


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!