Disambiguator

Ruud_Baars · August 12, 2021, 1:38pm

The postag combination
WKW:TGW:1EP followed by WKW:TGW:1EP
is very unlikely.
When both words also have an other postag, I want to remove this combination.

Can that be done?

jaumeortola · August 12, 2021, 2:44pm

It is possible with two rules:

<rule>
    <pattern>
        <marker>
            <and>
                <token postag="WKW:TGW:1EP"/>
                <token postag="WKW:TGW:1EP" negate_pos="yes"/>
            </and>    
        </marker>
        <and>
            <token postag="WKW:TGW:1EP"/>
            <token postag="WKW:TGW:1EP" negate_pos="yes"/>
        </and>
    </pattern>
    <disambig action="remove" postag="WKW:TGW:1EP"/>
</rule>
<rule>
    <pattern>
        <and>
            <token postag="WKW:TGW:1EP"/>
            <token postag="WKW:TGW:1EP" negate_pos="yes"/>
        </and>
        <marker>
            <and>
                <token postag="WKW:TGW:1EP"/>
                <token postag="WKW:TGW:1EP" negate_pos="yes"/>
            </and>    
        </marker>
    </pattern>
    <disambig action="remove" postag="WKW:TGW:1EP"/>
</rule>

Ruud_Baars · August 12, 2021, 3:01pm

I think this is not the same. The first rule could remove the postag that is tested for in the second one… (I guess)

Ruud_Baars · August 12, 2021, 3:06pm

It would be great to be able to filter out any unlikely postag order.
Something like

    <rule>
    <pattern>
      <token postag="pos1"/>
      <token postag="pos2"/>
      <token postag="pos3"/>
    <disambig action="remove"><wd pos="pos1"/><wd pos="pos2"/><wd pos="pos3"/></disambig>
    </rule>

(or better still, in shorthand in a file ‘unlikelypostagarrays.txt’)
pos1 pos2 pos3

jaumeortola · August 12, 2021, 3:10pm

You are right. This is a problem.

Ruud_Baars · August 12, 2021, 3:22pm

I can add values to the example:

<antipattern><token postag="WKW:TGW:1EP"/><token postag="WKW:TGW:1EP"/></antipattern> <!-- score: 35785.81888886987871956080 -->
    <!-- onderzoek verkloot --> : neither WKW:TGW:1EP
    <!-- word belast -->  : grammar mistake
    <!-- beter begrijp --> first is not WKW:TGW:1EP
    <!-- verwacht haar --> second is not WKW:TGW:1EP
    <!-- vlak veel --> et cetera....
    <!-- fruit haar -->
    <!-- woon jij -->
    <!-- open water -->
    <!-- veel vet -->
    <!-- welk gebaar -->
    <!-- uitstel net -->
    <!-- win jij -->
    <!-- heel ijl -->
    <!-- proces mag -->
    <!-- jong paar -->
    <!-- beroep lang -->
    <!-- nuttig gevoel -->
    <!-- jouw potlood -->
    <!-- boek zie -->
    <!-- stuur kun -->

Ruud_Baars · August 12, 2021, 4:44pm

Maybe removing all of them is a bit too brute. I will have to thinks this over a bit more. Maybe a filterall with longer, better patterns is a better method.