Again, reverse lookup for postags

Ruud_Baars · September 13, 2018, 2:33pm

Could someone please describe the process LT is doing to reverse lookup a matching tag? I keep having problems getting some rules to work.

1 a word is matched
2 the postag is stored
3 the replacement word is stored
4 the root word is fetched from the postag dictionary
5 the alternative word is searched in the postag dictionary
6 the child words of this root word are filtered by the regexp and the items matching the stored postag are offered

Is this a correct interpretation?

dnaber · September 13, 2018, 6:20pm

It’s maybe easier to help if you describe a simple example of a case that doesn’t work.

Ruud_Baars · September 14, 2018, 5:30am

This is the rule;

    <rulegroup id="GESCHIEDEN" name="geschieden">
        <rule>
            <antipattern><token>uw</token><token>wil</token><token>geschiede</token></antipattern>
            <pattern>
                <token inflected="yes">geschieden</token>
            </pattern>
            <message>De tekst wordt vlotter als je deze herschrijft met <suggestion><match no="1" postag="WKW:.*" postag_regexp="yes">gebeuren</match></suggestion>.</message>
            <url>https://onzetaal.nl/taaladvies/ouderwets-taalgebruik</url>
            <example correction="gebeurde">En zo <marker>geschiedde</marker>, want voor Jan is een woord een woord.</example>
            <example correction="gebeuren">Ook dit helpen dient beheerst te <marker>geschieden</marker>.</example>
            <example correction="gebeurt">Openen <marker>geschiedt</marker> door het duwen tegen de horizontale balk.</example>
            <example correction="gebeurde">De controle <marker>geschiedde</marker> door de pastoor.</example>
        </rule>
    </rulegroup>

And the dictionary for both roots is:

gebeur gebeuren WKW:TGW:1EP
gebeurde gebeuren WKW:VLT:1EP
gebeurden gebeuren WKW:VLT:INF
gebeuren gebeuren WKW:TGW:INF
gebeurend gebeuren WKW:ODW:ONV
gebeurende gebeuren WKW:ODW:VRB
gebeurt gebeuren WKW:TGW:3EP
gebeurd gebeuren WKW:VTD:ONV
gebeurde gebeuren WKW:VTD:VRB
gebeurden gebeuren WKW:VTD:MRV:DE_

geschied geschieden WKW:TGW:1EP
geschied geschieden WKW:VTD:ONV
geschiedde geschieden WKW:VLT:1EP
geschiedden geschieden WKW:VLT:INF
geschiede geschieden WKW:AWS
geschiede geschieden WKW:VTD:VRB
geschieden geschieden WKW:VTD:MRV:DE_
geschieden geschieden WKW:TGW:INF
geschiedend geschieden WKW:ODW:ONV
geschiedende geschieden WKW:ODW:VRB
geschiedt geschieden WKW:TGW:3EP

I don’t see what is wrong.

This rule will be commented out in the grammar.xml I will be committing today, together with the revised dictionaries.

jaumeortola · September 14, 2018, 7:11am

You have to synthesize the suggestion this way:

<suggestion><match no="1" postag="(WKW:.*)" postag_regexp="yes" postag_replace="$1">gebeuren</match></suggestion>

In postag="(WKW:.*)" you select the token you want from the original sentence (there can be more than one postag), and in postag_replace="$1" you put what you want to synthesize. In this case you keep the whole original postag, but you can change it totally or partially.

Ruud_Baars · September 14, 2018, 7:29am

I tried this, but it does not help. There must be more. Feel free to edit and test…

Strangely enough, it works for the other forms of ‘geschieden’, like ‘geschiedt’

@dnaber: I think this could be a bug; geschieden is in the dictionary as a WKW.* more than once. This is not very common, but also not rare. It could be LT is not deciding for the correct root?

I guess something is needed to make sure a root is selected; the tag of the matched word is the one to be uses (after regexp replace) to get the right derivative.

Or am I overlooking somthing?