SearchKit fails to respect kSKTermChars when CJK characters are present

Originator:renfei.song
Number:rdar://FB13652520 Date Originated:2/24/2024
Status:Open Resolved:
Product:macOS Product Version:macOS 14
Classification:SearchKit Reproducible:Yes
 
It appears that SearchKit sometimes does not respect the `kSKTermChars` property during indexing, if CJK characters are present in the text. For example, given these index properties:

    NSDictionary *properties = @{
        (__bridge NSString *)kSKProximityIndexing: @YES,
        (__bridge NSString *)kSKMaximumTerms: @(0),
        (__bridge NSString *)kSKTermChars: @".",
    };

Expected: text "测试 abc.123" is indexed as ["abc.123", "测试"]

Actual result: text "测试 abc.123" is indexed as ["123", "abc", "测试"]

Note: For comparison, "test abc.123" is indexed as ["abc.123", "test”], which aligns with expectations.

Impact: The main motivation for setting kSKTermChars is to allow these characters to be searchable. However, this issue could mean that searching will not yield the expected results. For example, the search query "abc.123" is unable to match the document "测试 abc.123".

This can be minimally reproduced by building an index and examining the indexed terms with `SKIndexCopyTermStringForTermID`. I have attached a reproducer as well.

Comments


Please note: Reports posted here will not necessarily be seen by Apple. All problems should be submitted at bugreport.apple.com before they are posted here. Please only post information for Radars that you have filed yourself, and please do not include Apple confidential information in your posts. Thank you!