Help Fixing the Khmer Spelling Checker

sungkhum · November 14, 2018, 6:19pm

We never used the Khmer spelling checker for LanguageTool but I noticed it is broken on the LanguageTool website

Khmer doesn’t have spaces, so with Unicode we add a zero-width space (U+200B) between the words so a spelling checker can know where the word boundary is (as well as line-break correctly).

Is there a way I can help fix this? The sentence should be broken like this:
ឃ្លា នេះ បង្ហាញ ពី ពី កំហុស វេយ្យាករណ៍ ដើម្បី បញ្ជាក់ ពី ប្រសិទ្ធភាព របស់ កម្មវិធី LanguageTool សំរាប់ ភាសាខ្មែរ។
(but with zero-width spaces, not real ones).

dnaber · November 14, 2018, 6:31pm

So it works everywhere else, just on the web page it’s broken? If so, we’re planning to rewrite that part anyway, but it might take 2-3 months.

sungkhum · November 14, 2018, 7:15pm

I believe we had it turned off in the LibreOffice extension since we use Hunspell for Khmer.
But looks like it doesn’t work in the Java application on a Mac either (in fact Khmer doesn’t display correctly at all).

l3043Y · April 22, 2021, 8:57am

Could you lead me to the java class in which is responsible for Khmer tokenizer? I have the same issues. LanguageTool detects handful errors on correct text. At the moment, the string tokenizer for Khmer confused incomplete sentences or a cluster of vocabularies with missing characters as a token.

dnaber · April 22, 2021, 8:59am

This is the Khmer tokenizer for words: languagetool/KhmerWordTokenizer.java at master · languagetool-org/languagetool · GitHub

l3043Y · April 22, 2021, 9:05am

Thank you! I hope to find a solution to tokenize Khmer vocabulary.