Hello @jaumeortola,
I have been trying to fix an issue with the start of writing letters, so that it doesn’t suggest adding “o” at start:
Meus amigos,
Meu querido amigo,
Meus queridos amigos,
Meu querido amigo Luís,
I have coded:
<antipattern>
<token postag='SENT_START'/>
<token postag='DP.+' postag_regexp='yes'/>
<token min="0" max="1" postag='AQ.+' postag_regexp='yes'/>
<token min="1" max="2" postag='NC.+|NP.+' postag_regexp='yes'/>
<token regexp='yes'>[,]</token>
</antipattern>
However, it triggers a false positive:
Meus queridos amigos,
Meu querido amigo,
Both produce different POS:
https://community.languagetool.org/analysis/index?lang=pt
Is it possible for you to improve the disambiguation, or is it too hard?
Thanks!
EDIT:
I have just improved the rule slightly:
<token postag='DP1.+' postag_regexp='yes'/>
@jaumeortola
Why doesn’t it work with two lines of text?
<antipattern>
<token postag='SENT_START'/>
<token regexp='yes'>meus?|minhas?</token>
<token min="0" max="1" postag='AQ.+' postag_regexp='yes'/>
<token min="1" max="2" postag='NC.+|NP.+' postag_regexp='yes'/>
<token postag='SENT_END'/>
<token regexp='yes'>[,]</token>
</antipattern>
Meu Irmão,
Estou a escrever esta carta para lhe desejar um Feliz Natal.
SENT_END should be the last actual token: <token postag='SENT_END' regexp='yes'>[,]</token>
You can use a more general antipattern:
<antipattern>
<token postag='SENT_START'/>
<token postag='DP.+' postag_regexp='yes'/>
<token min="1" max="3" postag='AQ.+|NC.+|NP.+' postag_regexp='yes'/>
<token regexp='yes'>[,]</token>
</antipattern>
I think this would be good enough.
I am testing it against a 200 000 corpus to see the results and enhance it a bit.
Thanks!
@jaumeortola
I have implemented it:
I had to make it more strict because of the false positives, so it is only used while starting letters to someone.
Thanks!
1 Like