[pt] Disambiguation improvements

Hello @jaumeortola,

I have been trying to fix an issue with the start of writing letters, so that it doesn’t suggest adding “o” at start:

Meus amigos,
Meu querido amigo,
Meus queridos amigos,
Meu querido amigo Luís,

I have coded:

  <antipattern>
	  <token postag='SENT_START'/>
	  <token postag='DP.+' postag_regexp='yes'/>
	  <token min="0" max="1" postag='AQ.+' postag_regexp='yes'/>
	  <token min="1" max="2" postag='NC.+|NP.+' postag_regexp='yes'/>
	  <token regexp='yes'>[,]</token>
  </antipattern>

However, it triggers a false positive:

Meus queridos amigos,
Meu querido amigo,

Both produce different POS:
https://community.languagetool.org/analysis/index?lang=pt

Is it possible for you to improve the disambiguation, or is it too hard?

Thanks!

EDIT:
I have just improved the rule slightly:
<token postag='DP1.+' postag_regexp='yes'/>

@jaumeortola

Why doesn’t it work with two lines of text?

  <antipattern>
	  <token postag='SENT_START'/>
	  <token regexp='yes'>meus?|minhas?</token>
	  <token min="0" max="1" postag='AQ.+' postag_regexp='yes'/>
	  <token min="1" max="2" postag='NC.+|NP.+' postag_regexp='yes'/>
	  <token postag='SENT_END'/>		  
	  <token regexp='yes'>[,]</token>
  </antipattern> 

Meu Irmão,

Estou a escrever esta carta para lhe desejar um Feliz Natal.

SENT_END should be the last actual token: <token postag='SENT_END' regexp='yes'>[,]</token>

You can use a more general antipattern:

<antipattern>
    <token postag='SENT_START'/>
    <token postag='DP.+' postag_regexp='yes'/>
    <token min="1" max="3" postag='AQ.+|NC.+|NP.+' postag_regexp='yes'/>
    <token regexp='yes'>[,]</token>
 </antipattern>

I think this would be good enough.

I am testing it against a 200 000 corpus to see the results and enhance it a bit.

Thanks!

@jaumeortola

I have implemented it:

I had to make it more strict because of the false positives, so it is only used while starting letters to someone.

Thanks!

1 Like