Problems with umlauts in GERMAN_SPELLER_RULE

Vegeeto · December 16, 2019, 1:41pm

Like the topic of the following link, LanguageTool is detecting all words with umlauts as wrong.

Example from the API:
Text: Sie können die Police vor dem Fälligkeitsdatum kündigen.
Language: de-DE
Curl: curl -X POST --header ‘Content-Type: application/x-www-form-urlencoded’ --header ‘Accept: application/json’ -d ‘text=Sie%20ko%CC%88nnen%20die%20Police%20vor%20dem%20Fa%CC%88lligkeitsdatum%20ku%CC%88ndigen.&language=de-DE&enabledOnly=false’ ‘https://languagetool.org/api/v2/check’

The response shows how “GERMAN_SPELLER_RULE” (“Möglicher Rechtschreibfehler”) rule detects words with umlauts as two words (breaking the word on the umlaut character).

Have umlauts special encoding?

dnaber · December 16, 2019, 1:42pm

Special characters should appear in their normal form for LT. This looks as if “ö” is “o” + umlaut characters, as separate chars. Instead, use the “ö” directly (and URL-encode it).

Vegeeto · December 16, 2019, 2:58pm

Well, looking at URL generated by API, I understand that text is URLEncoded (replaced “ö” with “%25C3%25B6”)

dnaber · December 16, 2019, 3:12pm

I don’t think the URL encoding step is the problem, but its input. “können” is encoded as ko%CC%88nnen in your original example. It should be k%C3%B6nnen I think.

Vegeeto · December 16, 2019, 3:18pm

Sure, I realised that is a different char. Thank you very much!