Back to LanguageTool Homepage - Privacy - Imprint

Disambiguation, modification before retrieval

Is it possible to change a word before the postag will be retrieved? E.g. change géén into geen? Having optional accents is quite normal in Dutch. Removing
áá éé óó úú íé etc. before postag retrieval would make the postag list a bit more efficient.
Same about the optional - : if no postag found with, check without.

Yes, it is possible to do it adding a few lines in the tagger. Something similar is done in the Catalan tagger.
I can do it for you. Tell me exactly all the possible replacements.

The replacements are:

á => a
é => e
ú => u
í => i
ú => u
In the condition that getting the postag with the modified word is only done when searching with accents does not result in any postag.
It can be as simple as just dropping all these accents (no others), no need to check all permutations.

Done here.
Only for lower case characters (I guess).
áéíóú => aeiou.
Is this correct? You wrote two ú.

This looks perfect. Thanks.
I will not change the data until after the upcoming release however, just to prevent hasty corrections for unexpected consequences. Just to be safe.

I would need almost the same for spell checking. A bit more complicated however. The replacing will have to be regexp replace there, because of the spelling rules.
áá => aa, éé => ee, óó => o, úú => uu, íé => ie
(and more 2 letter accent combinations)
and
([^aeiou])(á)([^aeiou]) => $1a$3
(for á, é í ó and ú …

Is that also doable?

It can be done in the tagger itself.
Please, provide examples for tagging and spelling, so we can test it properly.

Do you want to keep this code for the upcoming release or do you want me to comment it out?

íé => ie or íí => ii?

You can leave it in, since it works. Checked it on community.languagetool.org using the word ‘déúr’, which is not seen as correct (since it is not in the speller), but tagged correctly.
déúr deur ZNW:EKV:DE_

But it should be accepted by the spellchecker, would’t it?

Examples to test with:

déúr (postag and spellcheck ok)
kómen (postag and spellcheck ok)
háár (postag and spellcheck ok)
kán (postag and spellcheck ok)
ín (postag and spellcheck ok)
wéé (postag and spellcheck ok)

(none of these words are actually in the postag and speller with these accents.)
By the way, there is a ( missing in the regexp replace in the code, I think to create the first group in the regexp…

íí is wrong, since there is no vowel ii .

Added tests here.

See that “wéé” has already a tag in the dictionary (UNSPECIFIED), so no other tag is added.

Ah. That is one of the unforeseen consequences…
I am not a fan of those test that include database data in the code. Makes things very unflexible.

And there is still the issue that the tagging database that has to be updated to be consistent with all of this.

I will take care of that today.

What could I do for you in return for this favor? This means a large reduction of work on the dictionaries for me. Thanks a lot!

Next step will be to do (almost) the same for the option hyphenated words (keuken-deur where keukendeur is the normaal form). I can adapt the code for that myself. (After the freeze…)

Actually, I already added the single line for this. Database will be update after freeze (otherwise there will be too much updates for the binaries).

The code is (too) simple however, since it just removes all dashes. But actually some may be optional, while another one is required.