Heya, @jaumeortola ,
You are the right person to help me with this, since you have working on the disambiguator for a very long time.
After the official release of LanguageTool on Friday, I want to start improving the disambiguator as much as possible.
My trick is to use “dummy” rules to check for the results of the code I am going to implement in the disambiguator (dummy rules placed in my local style.xml).
This will be my first attempt at a high-impact rule:
<rulegroup id='MARCOAGPINTO_DIS_UNIVERSIDADES' name="Disambiguator: Universidades">
<rule>
<pattern>
<token case_sensitive='yes' regexp='yes'>Universidade|Instituto|Escola|Academia</token>
<token min='0' case_sensitive='yes' regexp='yes'>Estadual|Estatal|Federal|Militar|Politécnic[ao]|Profissional|Superior|Técnic[ao]|Universitári[ao]|Veterinári[ao]</token>
<token regexp='yes'>d[ao]s?|de</token>
<token postag='NP.+' postag_regexp='yes'>
<exception scope='next' postag_regexp='yes' postag='NP.+'/> <!-- More than one NP.+ must be handled via multiwords.txt -->
</token>
</pattern>
<message>Regra a testar o Disambiguator.</message>
<short>Testar o Disambiguator.</short>
<example correction=''><marker>Universidade Superior de Lisboa</marker></example>
</rule>
</rulegroup>
and here are the results:
Portuguese (Portugal): 898 total matches
Portuguese (Portugal): 854003 total sentences considered
4.txt (422.1 KB)
My biggest problem is that I don’t know how to do it properly yet.
I was looking at the code and found a rule created by Tiago many years ago:
<rule>
<pattern>
<token inflected='yes'>levar</token>
<token>a</token>
<marker>
<token>cabo</token>
</marker>
</pattern>
<disambig action="replace" postag="RG"/>
</rule>
But, how do I make it replace the whole words with just the NOUN with the gender and number of the first token based on my dummy rule?
I wanted to do something like this:
<match no="1" postag="NC(..)000" postag_replace='NP$1000' postag_regexp="yes"/>
Is this possible to be done in the disambiguator?
My idea is to have most universities to appear as a NP freeing the multiwords.txt file and at the same time getting tons of more valid results than just using multiwords.txt.
I restricted the rule to only accept one NP (exception scope=next) to make it more accurate.
Thanks!