The Esperanto spell checker in LanguageTool gives fairly frequent false errors for ordinal numbers such as “123-a” or “123-an” which should be correct (it’s like 123rd in English).
For example with the correct sentence “La 20-a jarcento.” (= the 20th century), LanguageTool wrongly underlines “a” in “20-a”.
I’m not sure yet how to fix it. I suppose I should be able to create a patch for the Hunspell eo.aff file (languagetool-language-modules/eo/src/main/resources/org/languagetool/resource/eo/hunspell/eo.aff) but I never got familiar enough with the Hunspell syntax.
Using <disambig action="ignore_spelling"/> looked promising, but I could not make it work. I suppose that it does not work because LT and Hunspell tokenize differently:
$ echo "2-a" | java -jar /home/pel/sb/languagetool/languagetool-standalone/target/LanguageTool-5.6-SNAPSHOT/LanguageTool-5.6-SNAPSHOT/languagetool-commandline.jar -l eo -v
Expected text language: Esperanto
Working on STDIN...
<S> 2-a[2-a/A nak np,</S>]<P/>
Disambiguator log:
431 rules activated for language Esperanto
1.) Line 1, column 3, Rule ID: HUNSPELL_RULE
Message: Ebla mistajpaĵo trovita
Suggestion: s; ia; al; la; ja
2-a
^
Notice that for LT, “2-a” is a token, whereas Hunspell only signals “a” as the typo (the hyphen splits words in Hunspell). So I think I have to tweak the Hunspell *.aff file with COMPOUNDRULE but Hunspell doc is not that great. So far I did not understand well enough how to do it.
You could start by adding - as a word char in the aff file, but beware of consequences…
‘words’ tht are built using - will no longer be accepted (like scot-free in English), and the same could happen for verb extensions some language have.
Best is to test using a huge list of Esperanto words (from books e.g.) and use
hunspell -d eo-EO -a to test that file and get the output for both aff files, and compare those.
Yes, I thought of that, but it would introduce other false positives as hyphen can be used in Esperanto to create words. As explained in PMEG : Helposignoj , in Esperanto you can create words with hyphens to improve readability like “kafo-tablo”, which is the same as “kafotablo” (= coffee table). So:
for spelling, it’s best to split with hyphens to reduce false spelling errors
for grammar, it’s best to consider “kafo-tablo” as a single word
But then <disambig action="ignore_spelling"/> does not seem to work.
Same is true for Dutch. If it is a common construction,it could be added to the compounding rules. If not, LT could be adjusted by this diambig rule, and maybe another one that takes 3 tokens. Number, dash and extension.
There are many examples for that kind of rule in the Dutch disambiguation file.