Testing spell checker

Ruud_Baars · September 26, 2017, 12:33pm

Is there a way to batch feed the morfologik spellchecker with words and get the word and suggestions out? This to improve the suggestions with rep-like replacements or plain replacements in the text file?

dnaber · September 26, 2017, 1:41pm

Maybe your best bet is to use command lines options -eo -e ID with the ID being the ID of the spell checker rule.

Ruud_Baars · September 26, 2017, 2:16pm

The output below shows that a simple one letter missing error yields a lot of suggestions. The order of the suggestions is certainly not by the order of frequency.

ruud@TaalTik:/media/ruud/data2/LanguageTool-current$ java -jar languagetool-commandline.jar -l nl -eo -e MORFOLOGIK_RULE_NL_NL --line-by-line
Expected text language: Dutch
Warning: running in line by line mode. Cross-paragraph checks will not work.

Working on STDIN…
fiet

1.) Line 1, column 1, Rule ID: MORFOLOGIK_RULE_NL_NL
Message: Mogelijke spelfout gevonden
Suggestion: bijt; dijt; fit; mijt; wijt; zijt; Eijt; jijt; rijt; bij; dit; dient; feit; fiets; hij; jij; liet; lijkt; lijn; lijst; mij; mijn; niet; pijn; tijd; uit; vijf; wij; ziet; zij; zijn; zit; Eijk; FIE; Fie; Fiji; … and a lot more

The frequencies are:
fiets 211798
zijt 36350

So I would have expected fiets before zijt, not after.

Any idea what this is caused by?

SkyCharger001 · September 26, 2017, 3:35pm

just speculation:

what if they aren’t frequencies but ‘priorities’?
(highest frequency gets priority 1, second highest gets priority 2 … and so on)

It’s not always reliable, but it might have been done to reduce file-size.
(instead of a, lets say, 1024-bit frequency value only a, lets say, 32-bit priority number needs to be stored)

Ruud_Baars · September 26, 2017, 6:51pm

All has been used as stated on the wiki.

dnaber · September 26, 2017, 7:07pm

The frequency is only the second sort criterion I think. The first one is how similar the words are. This can become a bit complicated because the replacement pairs from nl_NL.info are applied first. They might make zijt more similar to fiet than fiets (but I haven’t checked).

Ruud_Baars · September 26, 2017, 7:35pm

I will remove the replacements and try again.

Ruud_Baars · September 26, 2017, 7:40pm

Without all those replacements, it is better.
But actually, I assumed the order of presenting alternatives would be purely by the frequency class. Or maybe a weighted balance between levenhstein distance related to the word length and the frequency class…

Ruud_Baars · September 27, 2017, 6:57am

I will do more tests, but so far it looks like the use of the frequencies is too early; maybe it should be the last thing to do.
Some of the replacements are quite short : ij <=> ei , f <=> v, s<=>z. These are completely valid, but the impact is very large in short words.

I am checking all words in order of decreasing frequency with the spellchecker without replacements now. Then I will check which replacements are actually needed in the top of the frequency list, and make them as long as possible,

Ruud_Baars · October 2, 2017, 10:34am

I threw all words in order of descending frequency to the spell checking rule. My conclusions so far are:

LT is much better at suggesting than Hunspell is, especially when multiple letters in different parts of the word have changed.
LT does not do compounding; in compounding languages, the words list needed for the ‘tail’ is enormous.
The Hunspell REP’s are not of a lot of use in the .info; too many changes lead to less optimal suggestions. Some are needed though for suggestions very far from the word