Is there a way to batch feed the morfologik spellchecker with words and get the word and suggestions out? This to improve the suggestions with rep-like replacements or plain replacements in the text file?
Maybe your best bet is to use command lines options
-eo -e ID with the ID being the ID of the spell checker rule.
The output below shows that a simple one letter missing error yields a lot of suggestions. The order of the suggestions is certainly not by the order of frequency.
ruud@TaalTik:/media/ruud/data2/LanguageTool-current$ java -jar languagetool-commandline.jar -l nl -eo -e MORFOLOGIK_RULE_NL_NL --line-by-line
Expected text language: Dutch
Warning: running in line by line mode. Cross-paragraph checks will not work.
Working on STDIN…
1.) Line 1, column 1, Rule ID: MORFOLOGIK_RULE_NL_NL
Message: Mogelijke spelfout gevonden
Suggestion: bijt; dijt; fit; mijt; wijt; zijt; Eijt; jijt; rijt; bij; dit; dient; feit; fiets; hij; jij; liet; lijkt; lijn; lijst; mij; mijn; niet; pijn; tijd; uit; vijf; wij; ziet; zij; zijn; zit; Eijk; FIE; Fie; Fiji; … and a lot more
The frequencies are:
So I would have expected fiets before zijt, not after.
Any idea what this is caused by?
what if they aren’t frequencies but ‘priorities’?
(highest frequency gets priority 1, second highest gets priority 2 … and so on)
It’s not always reliable, but it might have been done to reduce file-size.
(instead of a, lets say, 1024-bit frequency value only a, lets say, 32-bit priority number needs to be stored)
All has been used as stated on the wiki.
The frequency is only the second sort criterion I think. The first one is how similar the words are. This can become a bit complicated because the replacement pairs from
nl_NL.info are applied first. They might make
zijt more similar to
fiets (but I haven’t checked).
I will remove the replacements and try again.
Without all those replacements, it is better.
But actually, I assumed the order of presenting alternatives would be purely by the frequency class. Or maybe a weighted balance between levenhstein distance related to the word length and the frequency class…
I will do more tests, but so far it looks like the use of the frequencies is too early; maybe it should be the last thing to do.
Some of the replacements are quite short : ij <=> ei , f <=> v, s<=>z. These are completely valid, but the impact is very large in short words.
I am checking all words in order of decreasing frequency with the spellchecker without replacements now. Then I will check which replacements are actually needed in the top of the frequency list, and make them as long as possible,
I threw all words in order of descending frequency to the spell checking rule. My conclusions so far are:
- LT is much better at suggesting than Hunspell is, especially when multiple letters in different parts of the word have changed.
- LT does not do compounding; in compounding languages, the words list needed for the ‘tail’ is enormous.
- The Hunspell REP’s are not of a lot of use in the .info; too many changes lead to less optimal suggestions. Some are needed though for suggestions very far from the word