Back to LanguageTool Homepage - Privacy - Imprint

The tool does not recognize so many spelling mistakes in Persian language


(Armin Ghassemi Rudd) #1

Hi,

Just as an example, the Persian text at the home page of https://languagetool.org/ has several spelling mistakes that the tool does not recognize them:
لطفا متن خود را اینجا قرار دهید . یا بررسی کنید که این متن را‌ برای دیدن بعضی بعضی از اشکال هایی که ابزار زبان توانسته تشخیس هدد. درباره ی نرم افزارهای بررسی کننده های گرامر چه فکر می کنید؟ لطفا در نظر داشته باشید که آن‌ها بی نقص نمی باشند.‎

The two bolded words in the example are incorrect and the correct words are: تشخیص دهد

It seems the tool is only recognizing the grammar mistakes not the spelling mistakes for Persian.
Is it possible to add the true spell checker to Persian language as well?

Best,
Armin


(Daniel Naber) #2

There’s currently no maintainer for Persian in LanguageTool, thus basically no work is done for it. If you’d like to become the maintainer, or know someone who would, please let us know.


(Armin Ghassemi Rudd) #3

What exactly should the maintainer do?
I’m interested to help. :slightly_smiling_face:

Meanwhile, Isn’t there any open dictionary available for Persian to use in the spell checker? What do Google Chrome and Microsoft Word use to spell check the Persian strings?


(Daniel Naber) #4

The maintainers task is to improve error detection rules, write new rules, and generally take care of that language in LT. Here’s a more detailed description. Maintainers don’t need to be software developers, although it helps.


(Armin Ghassemi Rudd) #5

Well,
I’ve created about 200 rules and have added them to the previous grammar.xml file for Persian (fa).
How can I send the file to you?


(Daniel Naber) #6

Great, just attach the file here. How have you tested those rules, i.e. did you run them against Persian Wikipedia etc?


(Armin Ghassemi Rudd) #7

The file is attached. I just added the words in “Replace.txt” file in the same directory as they are true but do not appear in spell checking.
The rules that I have added are under CAT5 category. (I created this category and the type is “misspelling”). This category has more than 350 rules.
I also fixed one or two rules in the previous work.
I created a dictionary in excel with 2 columns, one for misspells and one for the correction. Then made rules from them.
I tested the rules in the stand alone software using a long article from wikipedia. For now everything is fine. I will be in touch if there is any problem. And will send more rules in the future.

grammar.zip (29.7 KB)

Best,
Armin


(Daniel Naber) #8

Thanks, but I think there’s a problem with your approach: there can be any number of spelling errors, and we cannot write rules for all of them. So I think it would be better to search for an Open Source Persian spell checker / dictionary (probably hunspell-based) and see if we can use that. Technical details are documented in the wiki.


(Armin Ghassemi Rudd) #9

The errors I have added are the most common errors in Persian (part of them by one of the universities in Iran at this page and part of them at this wiki page and this page)
The errors with a particular pattern were already ruled. These errors that I have added are mostly the most common typos in Persian and there is no common pattern for them.


(Ruud Baars) #10

If Persian is Farsi, there is an open public dictionary for Hunspell, e.g.


If you need words frequency lists for Persian, I could supply one.


(Armin Ghassemi Rudd) #11

Thank you so much. So how exactly should I use them to modify LT spell checking for Persian?
I mean creating rules from a list of incorrect and correct words is so easy for me (automated).
Should I use these dictionaries to create rules?


(Ruud Baars) #12

Find a recent hunspell dictionary and affix file first; maybe from on of the links. Then maybe one of the programming contributors would be nice enough to add the spellchecker to LT.

In the meantime, I could teach you some Hunspell tricks. Just contact me directly.
The effort will be to find words that are wrong, but accepted, as well as words that are correct, but are not. And check if the suggestions provided are good enough. In all 3 cases, the solution is relatively simple.
The words frequency list I have for you has 1.7 million entries.

For now, you could download www.taaltik.nl/Persian/Persian.zip. This file contains the speller .dic, .aff and a frequency list. It also has the list of words in the frequencylist accepted by this speller, as well as the list of refused words.

This could give you a good start finding words missing in the speller, as well as words that are accepted, but which you consider incorrect.


(Armin Ghassemi Rudd) #13

Well,
I have contacted one of the developers and just opened an issue on github for this.
I hope we will do the job soon.
Best,
Armin