Case conversion int StringTools ignores the locale

arysin · December 31, 2024, 7:47pm

There’s an interesting problem popped up in crh module.
Most locales are non-conflicting when we need to do the case conversion, so our StringTools methods that deal with case don’t take the locale as an argument.
But there are few locales that have non-standard rules, e.g. in several locales like tr, tt, kk, az, and crh have an “i” conversion different:

Lowercase “I” → “ı” (dotless i)

Uppercase “i” → “İ” (dotted İ)
I was able to fix few places (tagger and spellchecking recognition for this by using “tr_UA” locale for crh - note: there’s no “crh” locale unfortunately), but spellchecking suggestion still uses locale-less case conversion so misspelled İngliiz will have an (incorrect) suggestion of Іngliz instead of İngliz.

On one side this is very odd case and only affects one language in LT but on the other side any character manipulation should use the language locale for all operations consistently. E.g. we use locale argument in some calls to toLowerCase() in LT but not in others.

So I am wondering if we should adjust StringTools to use the language locale and some other places with toLower/UpperCase should do that consistently too.

Another aspect of this is that we have a locale (optionally) specified in the speller dictionary .info file - that is used by morfologik speller, but only 2 languages actually specify that local, and you also specify it in the language class. I wonder if need to make them more in sync.

LanguageToolSupport · January 6, 2025, 1:13pm

Hi Arysin,

Thank you for reporting this. Can you please specify which languages are this applies to?

Thank you!

arysin · January 6, 2025, 3:42pm

In general it applies to these locales: tr, tt, kk, az, crh
But in LT we only have crh (Crimean Tatar) - I was preparing the changes for crh to handle it right and realized I need to adjust StringTools so I just wanted to make sure everybody is ok with that.