The rule below is quite simple. It tries to match a noun postag in between two common words that usually contains a singular or plural noun. It works okay for words with one of those tags. But when I put a word in it that has only different tags (it happens in texts in the wild), it results in a strange tag:
<rule name="de_ZNW_van_2" id="DE_ZNW_VAN_2"> <pattern> <token>de</token> <marker> <token></token> </marker> <token>van</token> </pattern> <disambig><match no="1" postag="ZNW.*DE_.*" postag_regexp="yes" /></disambig> </rule>
java -jar languagetool-commandline.jar -l nl -t
Expected text language: Dutch
Dit is de kindje van mij.
Working on STDIN…
Dit[Dit/null] is[is/ZNW:EKV,zijn/WKW:TGW:3EP] de[de/null] kindje[kind/ZNW.DE_.] van[van/VRZ,van/ZNW:EKV] mij[mij/null].
ZNW.DE_. was the filter, and now ended up as the tag. While there were different tags assigned to this word. I think this result should be null/UNKOWN.
Don’t you think so?
Somehow, a filterall rule behaves differently.
<rule id="A" name="a"> <pattern> <marker> <token>de</token> <token postag="ZNW.*DE_.*" postag_regexp="yes"/> <token>van</token> </marker> </pattern> <disambig action="filterall"/> </rule>
This should do the same, I guess. But it does not:
Dit[Dit/null] is[is/ZNW:EKV,zijn/WKW:TGW:3EP] de[de/null] kindje[kind/ZNW:EKV:VRK:HET] van[van/VRZ,van/ZNW:EKV] a[a/ZNW:EKV:DE_]
It leaves the tag, even though it is not matched.
Furthermore… When I add a postag on a word (with lemma or without) and it already has this tag, it ends up with two identical tags. This is not a problem, but not correct, or is it?
What am I doing wrong, or what is the disambiguation doing wrong?
What I am actually looking for is to drop all tags and add a new one on a word within a pattern.