LT spellchecking is not working as expected

vipranarayan14 · February 9, 2024, 9:29pm

Hi all,

I am trying to add support for Sanskrit.

Following the instructions in the “Adding a new language” page, I have forked the LT repo, created the minimum required files/folders for Sanskrit and added the necessary config.

I have created a simple spellchecking dict using a small list of about 20 words, with the help of the steps given in the docs and in a discussion in the forum. Also I have added a few rules using the grammar.xml file.

After building (mvn clean package -DskipTests), I am getting support for Sanskrit in the standalone app.

So far so good. No errors (except for the tests failing due to some issue in the language-en module).

But when I tested the spellchecker with various words, I got weird results:

रामः (line 1) is a valid word in the dict and it is getting recognised (I think).
रामुः (line 3) is an invalid word and is not in the dict. But the spellchecker does not mark it and no suggestions are shown.
But if a Latin word character is added to the invalid word (line 5), the word is correctly marked as invalid and suggestions are shown.
सीता (line 7) is not in the dict. But the spellchecker doesn’t show any error.
The word “example” (line 9) is getting the suggestion “Apple”. I don’t understand where the spellchecker is getting it from. Global dict?
However, my grammar.xml rules are working as expected.

Are the issues in 2 & 3 due to some script problem? Do I have to configure the script for Sanskrit somewhere?
How to make the spellchecker always mark out-of-vocabulary words (issue in 4) as errors?
How to make the spellchecker not show suggestions for non-Sanskrit words (issue in 5)?

This is my fork, for your reference. Please let me know if I am doing something wrong.

Thanks in advance.

– Prasanna

dnaber · February 10, 2024, 2:56pm

I can’t answer your questions without spending time to debug this, but I suggest you use hunpsell directly, not Morfologik. That is, use plain text .dic files instead of binary .dict files. Arabic does that, so you can find an example in ArabicHunspellSpellerRule.

vipranarayan14 · February 11, 2024, 6:21am

Thank you so much @dnaber! I replaced the Morfologik with my Hunspell dictionary and now it’s working as expected:

The invalid words are marked (even without adding Latin characters).
Proper suggestions are shown just like in Hunspell.
The OOV words in Sanskrit are also marked as invalid and suggestions are shown for them.
The non-Sanskrit words are also marked and they do not get any suggestions. (i.e. No “Apple” for “example”).

I can’t answer your questions without spending time to debug this

I tried to debug it myself. I traced it back to org.languagetool.rules.spelling.morfologik where the morfologik speller (morfologik.speller.Speller) itself returns isMisspelled === false for invalid Sanskrit words and isMisspelled === true for non-Sanskrit words.

However, I couldn’t debug it any further since I could not edit the Morfologik library files from within the project. So, I think either there is a problem with my .dict file (which did compile successfully) or there is an issue with how Morfologik handles my dictionary (maybe some wrong config).