Changes in Dutch tagger/synthesizer

dnaber · March 29, 2018, 8:35am

Hi @Ruud_Baars, your recent changes to the tagger made the tests fail, so I’ll comment them out.

For example, expanding “doorseinen” with tag WKW.* has a different result now:

OLD: doorseine, doorseinenden, doorseinend, doorseinende, doorsein, doorseint, doorseinen, doorseinde, doorseinden, doorgeseind, doorgeseinde

NEW: doorseinend, doorseinende, doorsein, doorseint, doorseinen, doorseinde, doorseinden, doorgeseind, doorgeseinde

Is this expected or is this a bug? Also, “Afro-Surinamer” expanded by ZNW:MRV:DE_ used to be “Afro-Surinamers”, now it doesn’t expand anymore. Have the tags changed so that this is expected?

dnaber · March 29, 2018, 8:48am

One more change:

OLD :Dit/[null]null – is/[is]ZNW:EKV|is/[zijn]WKW:TGW:3EP – een/[een]GET|een/[een]ZNW:EKV:DE_ – Nederlandse/[Nederlandse]ZNW:EKV – zin/[zin]ZNW:EKV:DE_|zin/[zinnen]WKW:TGW:1EP – om/[om]VRZ – het/[null]null – programma/[programma]ZNW:EKV:HET – tje/[null]null – te/[te]VRZ – testen/[test]ZNW:MRV:DE_|testen/[testen]WKW:TGW:INF

NEW :Dit/[null]null – is/[is]ZNW:EKV|is/[zijn]WKW:TGW:3EP – een/[een]GET|een/[een]ZNW:EKV:DE_ – Nederlandse/[Nederlands]BNW:STL:VRB – zin/[zin]ZNW:EKV:DE_|zin/[zinnen]WKW:TGW:1EP – om/[om]VRZ – het/[null]null – programma/[programma]ZNW:EKV:HET – tje/[null]null – te/[te]VRZ – testen/[test]ZNW:MRV:DE_|testen/[testen]WKW:TGW:INF

Ruud_Baars · March 29, 2018, 8:50am

I don’t get those errors…
Are these from the code?
These replacements are a pain in the ~.

Some tags were changed, yes.

dnaber · March 29, 2018, 8:53am

I get them when I run mvn --projects languagetool-language-modules/nl --also-make test. This is more extensive than just testrules.sh. Anyway, are these changes expected because the POS dict has changed or do they point out bugs in the new POS dict?

Ruud_Baars · March 29, 2018, 8:59am

I will have to look into it. In general, the new postag dict is better. These two examples also indicate an improvement. But, nevertheless, there is a structural issue between nouns and adjectives; many adjectives can be used as a noun, just like in German.

These sentences are also very sensitive to improvements in disambiguation. I guess it is better to choose only un-ambiguous words there.

There is also something strange going on. The tags delivered from the command line tagger are better than those from the testrun!

For programma’s , I get two tokens in the java program, and just one in the command line, which is much better. Is not the Dutch tokenizer used in the program? Also the tokens appear to be different!

The command line states:
De[De/null] programmeur[programmeur/ZNW:EKV:DE_] vervaardigt[vervaardigen/WKW:TGW:3EP] het[het/null] programma[programma/ZNW:EKV:HET].
The test reports:

expected:<De[[De/null] – programmeur[programmeur/ZNW:EKV:DE_] – vervaardigt[vervaardigen/WKW:TGW:3EP] – het[het/null] programma[programma/ZNW:EKV:HET]]>

but was:
<De[/[null]null – programmeur/[programmeur]ZNW:EKV:DE_ – vervaardigt/[vervaardigen]WKW:TGW:3EP – het/[null]null – programma/[programma]ZNW:EKV:HET]>

Anyway, I would advice to use a different sentence there: De programmeur vervaardigt het programma.
Apparently, There is something going on here I don’t understand. So I would not know how to adjust it.
Maybe it is even more safe to not test a complete sentence…

dnaber · March 29, 2018, 9:43am

I don’t see any difference here (only the format looks a bit different). “programma” is always tagged as ZNW:EKV:HET. I have updates the tests to reflect the current POS tags, so everything should be okay again.

Ruud_Baars · March 29, 2018, 9:46am

It looks to work fine, but the test routine uses import org.languagetool.tokenizers.WordTokenizer; , not Dutchtagger. So there are differences when a word like “programma’s” are used, since the default tokenizer splits that into programma and s, while the Dutch tokenizer keeps it together as one word.

dnaber · March 29, 2018, 9:59am

I see - I have just changed that.