[pt] Creating disambiguator rules

marcoagpinto · September 22, 2024, 12:42pm

You are the right person to help me with this, since you have working on the disambiguator for a very long time.

After the official release of LanguageTool on Friday, I want to start improving the disambiguator as much as possible.

My trick is to use “dummy” rules to check for the results of the code I am going to implement in the disambiguator (dummy rules placed in my local style.xml).

This will be my first attempt at a high-impact rule:

<rulegroup id='MARCOAGPINTO_DIS_UNIVERSIDADES' name="Disambiguator: Universidades">

   <rule>
        <pattern>
            <token case_sensitive='yes' regexp='yes'>Universidade|Instituto|Escola|Academia</token>
            <token min='0' case_sensitive='yes' regexp='yes'>Estadual|Estatal|Federal|Militar|Politécnic[ao]|Profissional|Superior|Técnic[ao]|Universitári[ao]|Veterinári[ao]</token>
            <token regexp='yes'>d[ao]s?|de</token>
            <token postag='NP.+' postag_regexp='yes'>
                <exception scope='next' postag_regexp='yes' postag='NP.+'/> <!-- More than one NP.+ must be handled via multiwords.txt -->
            </token>
        </pattern>
        <message>Regra a testar o Disambiguator.</message>
        <short>Testar o Disambiguator.</short>
        <example correction=''><marker>Universidade Superior de Lisboa</marker></example>
    </rule>

</rulegroup>

and here are the results:

Portuguese (Portugal): 898 total matches
Portuguese (Portugal): 854003 total sentences considered

4.txt (422.1 KB)

My biggest problem is that I don’t know how to do it properly yet.

I was looking at the code and found a rule created by Tiago many years ago:

    <rule>
      <pattern>
        <token inflected='yes'>levar</token>
        <token>a</token>
        <marker>
          <token>cabo</token>
        </marker>
      </pattern>
      <disambig action="replace" postag="RG"/>
    </rule>

But, how do I make it replace the whole words with just the NOUN with the gender and number of the first token based on my dummy rule?

I wanted to do something like this:
<match no="1" postag="NC(..)000" postag_replace='NP$1000' postag_regexp="yes"/>

Is this possible to be done in the disambiguator?

My idea is to have most universities to appear as a NP freeing the multiwords.txt file and at the same time getting tons of more valid results than just using multiwords.txt.

I restricted the rule to only accept one NP (exception scope=next) to make it more accurate.

Thanks!