Testing spell checker

Is there a way to batch feed the morfologik spellchecker with words and get the word and suggestions out? This to improve the suggestions with rep-like replacements or plain replacements in the text file?

Maybe your best bet is to use command lines options -eo -e ID with the ID being the ID of the spell checker rule.

The output below shows that a simple one letter missing error yields a lot of suggestions. The order of the suggestions is certainly not by the order of frequency.


ruud@TaalTik:/media/ruud/data2/LanguageTool-current$ java -jar languagetool-commandline.jar -l nl -eo -e MORFOLOGIK_RULE_NL_NL --line-by-line
Expected text language: Dutch
Warning: running in line by line mode. Cross-paragraph checks will not work.

Working on STDIN…
fiet

1.) Line 1, column 1, Rule ID: MORFOLOGIK_RULE_NL_NL
Message: Mogelijke spelfout gevonden
Suggestion: bijt; dijt; fit; mijt; wijt; zijt; Eijt; jijt; rijt; bij; dit; dient; feit; fiets; hij; jij; liet; lijkt; lijn; lijst; mij; mijn; niet; pijn; tijd; uit; vijf; wij; ziet; zij; zijn; zit; Eijk; FIE; Fie; Fiji; … and a lot more

The frequencies are:
fiets 211798
zijt 36350

So I would have expected fiets before zijt, not after.

Any idea what this is caused by?

just speculation:

what if they aren’t frequencies but ‘priorities’?
(highest frequency gets priority 1, second highest gets priority 2 … and so on)

It’s not always reliable, but it might have been done to reduce file-size.
(instead of a, lets say, 1024-bit frequency value only a, lets say, 32-bit priority number needs to be stored)

All has been used as stated on the wiki.

The frequency is only the second sort criterion I think. The first one is how similar the words are. This can become a bit complicated because the replacement pairs from nl_NL.info are applied first. They might make zijt more similar to fiet than fiets (but I haven’t checked).

I will remove the replacements and try again.

Without all those replacements, it is better.
But actually, I assumed the order of presenting alternatives would be purely by the frequency class. Or maybe a weighted balance between levenhstein distance related to the word length and the frequency class…

I will do more tests, but so far it looks like the use of the frequencies is too early; maybe it should be the last thing to do.
Some of the replacements are quite short : ij <=> ei , f <=> v, s<=>z. These are completely valid, but the impact is very large in short words.

I am checking all words in order of decreasing frequency with the spellchecker without replacements now. Then I will check which replacements are actually needed in the top of the frequency list, and make them as long as possible,

I threw all words in order of descending frequency to the spell checking rule. My conclusions so far are:

  • LT is much better at suggesting than Hunspell is, especially when multiple letters in different parts of the word have changed.
  • LT does not do compounding; in compounding languages, the words list needed for the ‘tail’ is enormous.
  • The Hunspell REP’s are not of a lot of use in the .info; too many changes lead to less optimal suggestions. Some are needed though for suggestions very far from the word