[eo] Frequent false positives with Esperanto ordinal numbers such as "123-a" or "123-an"

Dominique_PELLE · November 12, 2021, 3:23am

The Esperanto spell checker in LanguageTool gives fairly frequent false errors for ordinal numbers such as “123-a” or “123-an” which should be correct (it’s like 123rd in English).

For example with the correct sentence “La 20-a jarcento.” (= the 20th century), LanguageTool wrongly underlines “a” in “20-a”.

I’m not sure yet how to fix it. I suppose I should be able to create a patch for the Hunspell eo.aff file (languagetool-language-modules/eo/src/main/resources/org/languagetool/resource/eo/hunspell/eo.aff) but I never got familiar enough with the Hunspell syntax.

I see COMPOUNDRULE being used for English ordinal numbers in hunspell/compoundrule4.aff at master · hunspell/hunspell · GitHub but it’s not clear to me yet how it works.

Any idea how it could be fixed?

jaumeortola · November 12, 2021, 7:05am

A possible solution is a disambiguation rule with <disambig action="ignore_spelling"/>.

Dominique_PELLE · November 14, 2021, 11:35pm

Using <disambig action="ignore_spelling"/> looked promising, but I could not make it work. I suppose that it does not work because LT and Hunspell tokenize differently:

$ echo "2-a" | java -jar /home/pel/sb/languagetool/languagetool-standalone/target/LanguageTool-5.6-SNAPSHOT/LanguageTool-5.6-SNAPSHOT/languagetool-commandline.jar -l eo -v
Expected text language: Esperanto
Working on STDIN...
<S> 2-a[2-a/A nak np,</S>]<P/> 
Disambiguator log: 

431 rules activated for language Esperanto
1.) Line 1, column 3, Rule ID: HUNSPELL_RULE
Message: Ebla mistajpaĵo trovita
Suggestion: s; ia; al; la; ja
2-a 
  ^

Notice that for LT, “2-a” is a token, whereas Hunspell only signals “a” as the typo (the hyphen splits words in Hunspell). So I think I have to tweak the Hunspell *.aff file with COMPOUNDRULE but Hunspell doc is not that great. So far I did not understand well enough how to do it.

Ruud_Baars · November 15, 2021, 7:09am

I can help with either. If LT itself has this issue, check out the tokenisation on the community site. That will help fixing it for LT.

After that, we could look at Hunspell, but that is not trivial. Please contact me directly for that. (Will be away for a few days)

Ruud_Baars · November 15, 2021, 7:11am

You could start by adding - as a word char in the aff file, but beware of consequences…
‘words’ tht are built using - will no longer be accepted (like scot-free in English), and the same could happen for verb extensions some language have.

Best is to test using a huge list of Esperanto words (from books e.g.) and use
hunspell -d eo-EO -a to test that file and get the output for both aff files, and compare those.

Dominique_PELLE · November 15, 2021, 7:46am

Yes, I thought of that, but it would introduce other false positives as hyphen can be used in Esperanto to create words. As explained in PMEG : Helposignoj , in Esperanto you can create words with hyphens to improve readability like “kafo-tablo”, which is the same as “kafotablo” (= coffee table). So:

for spelling, it’s best to split with hyphens to reduce false spelling errors
for grammar, it’s best to consider “kafo-tablo” as a single word

But then <disambig action="ignore_spelling"/> does not seem to work.

Ruud_Baars · November 15, 2021, 8:25am

Same is true for Dutch. If it is a common construction,it could be added to the compounding rules. If not, LT could be adjusted by this diambig rule, and maybe another one that takes 3 tokens. Number, dash and extension.
There are many examples for that kind of rule in the Dutch disambiguation file.

jaumeortola · November 15, 2021, 8:58pm

Another possible solution: extend de Hunspell Java rule for Esperanto and add exceptions there.