Disambiguation, modification before retrieval

Ruud_Baars · June 15, 2019, 3:10pm

Is it possible to change a word before the postag will be retrieved? E.g. change géén into geen? Having optional accents is quite normal in Dutch. Removing
áá éé óó úú íé etc. before postag retrieval would make the postag list a bit more efficient.
Same about the optional - : if no postag found with, check without.

jaumeortola · June 17, 2019, 2:32pm

Yes, it is possible to do it adding a few lines in the tagger. Something similar is done in the Catalan tagger.
I can do it for you. Tell me exactly all the possible replacements.

Ruud_Baars · June 17, 2019, 4:43pm

The replacements are:

á => a
é => e
ú => u
í => i
ú => u
In the condition that getting the postag with the modified word is only done when searching with accents does not result in any postag.
It can be as simple as just dropping all these accents (no others), no need to check all permutations.

jaumeortola · June 17, 2019, 9:23pm

Done here.
Only for lower case characters (I guess).
áéíóú => aeiou.
Is this correct? You wrote two ú.

Ruud_Baars · June 18, 2019, 5:34am

This looks perfect. Thanks.
I will not change the data until after the upcoming release however, just to prevent hasty corrections for unexpected consequences. Just to be safe.

I would need almost the same for spell checking. A bit more complicated however. The replacing will have to be regexp replace there, because of the spelling rules.
áá => aa, éé => ee, óó => o, úú => uu, íé => ie
(and more 2 letter accent combinations)
and
([^aeiou])(á)([^aeiou]) => $1a$3
(for á, é í ó and ú …

Is that also doable?

jaumeortola · June 18, 2019, 3:35pm

It can be done in the tagger itself.
Please, provide examples for tagging and spelling, so we can test it properly.

Do you want to keep this code for the upcoming release or do you want me to comment it out?

jaumeortola · June 18, 2019, 3:36pm

íé => ie or íí => ii?

Ruud_Baars · June 19, 2019, 6:50am

You can leave it in, since it works. Checked it on community.languagetool.org using the word ‘déúr’, which is not seen as correct (since it is not in the speller), but tagged correctly.
déúr deur ZNW:EKV:DE_

But it should be accepted by the spellchecker, would’t it?

Examples to test with:

déúr (postag and spellcheck ok)
kómen (postag and spellcheck ok)
háár (postag and spellcheck ok)
kán (postag and spellcheck ok)
ín (postag and spellcheck ok)
wéé (postag and spellcheck ok)

(none of these words are actually in the postag and speller with these accents.)
By the way, there is a ( missing in the regexp replace in the code, I think to create the first group in the regexp…

Ruud_Baars · June 19, 2019, 6:53am

íí is wrong, since there is no vowel ii .

jaumeortola · June 19, 2019, 7:40am

Added tests here.

See that “wéé” has already a tag in the dictionary (UNSPECIFIED), so no other tag is added.

Ruud_Baars · June 19, 2019, 8:07am

Ah. That is one of the unforeseen consequences…
I am not a fan of those test that include database data in the code. Makes things very unflexible.

And there is still the issue that the tagging database that has to be updated to be consistent with all of this.

I will take care of that today.

Ruud_Baars · June 19, 2019, 8:32am

What could I do for you in return for this favor? This means a large reduction of work on the dictionaries for me. Thanks a lot!

Ruud_Baars · June 19, 2019, 8:35am

Next step will be to do (almost) the same for the option hyphenated words (keuken-deur where keukendeur is the normaal form). I can adapt the code for that myself. (After the freeze…)

Actually, I already added the single line for this. Database will be update after freeze (otherwise there will be too much updates for the binaries).

The code is (too) simple however, since it just removes all dashes. But actually some may be optional, while another one is required.