[pt] POSTAG bug hunting thread

POSTAGs are a feature of the morphological dictionary that catalogues the words by their use in the language and tags them in relation to one another or to a common base form. Some rules are based on these tags instead of words, so they can apply automatically to all words (or word relations) of one group.

All datasets have errors and this is no exception. The most ubiquitous POSTAG errors have been found and fixed but there a still a few hundred there to be found. That is not much (considering the hundreds of thousands postags), but is a too daunting task to systematically find all these misbehaving items, so, any error reporting of this kind is welcome.

If you found one or you have a rule that should be working and it is not:

  • Paste the phrase on text analyser (Text Analysis - LanguageTool)
  • Click ‘Show analysis’;
  • Verify in your dictionary that that form does not exist (or that is in fact missing);
  • Post here like this: word (token); base form (Lemma); wrong tag (abbreviated on Part-of-speech).

Cheers!

Hello @tiagosantos

Maybe I found another mistake in the morphologic dictionary?
puseste pôr VMIS2P0 VMIS2S0
http://community.languagetool.org/analysis/analyzeText
It says it is P0 and S0.

Great find Marco.
puseste pôr VMIS2P0 (just the plural, right?)
I will add later in the daily dictionary fix commit, as well as other improvements to yesterday’s rules commit.

Yes, just the plural.

Tiago, how do you convert:
SPS00|VMN0000|VMIP1S0|VMIP2S0|VMIP3S0|VMIP1P0|VMIP2P0|VMIP3P0|VMIS1S0|VMIS2S0|VMIS3S0|VMIS1P0|VMIS2P0|VMIS3P0
into [bla blah][blah blah]?

I tried several approaches but the stand-alone tool doesn’t recognise them:
SPS00|VMN0000|VM[IP1][IP2][IP3]S0|VM[IP1][IP2][IP3]P0|VM[IS1][IS2][IS3]S0|VM[IS1][IS2][IS3]P0
Verb “pôr”.
The first is “para”, “até”, etc.
The second is the infinitive of a verb.
Rule: “ai” > “aí”

Is there a simple way of addressing all forms of a verb without to type each by hand?

When you make [ ] each letter inside the brackets is an option. I also found while porting spanish rules a great simplification the .
For example, you want all verb forms that are in the indicative and that are only singular. You can write something like:
V.I…S. and that would work.

From your example you want all propositions and all verb forms (many postags missing but I believe that is it).
That would be just SPS00|V.*

If somebody notices something wrong in my interpretation, just correct me. I am still figuring out the sintaxe and all new tricks are welcome.

Very good!

It has worked!

Thanks, Tiago!

@tiagosantos

What about the word “cosmos”?

It says there is a concordance error with:
“o cosmos”.

It seems it can be used in plural and in singular.

I have just found another one:
“dica”
“e concede-lhe uma dica adicional para o estimular”

Another one:
“face”
“era um face à criação.”

And another one (sorry… it is the last one tonight):
“Génesis”
“No segundo capítulo do Génesis”

http://www.priberam.pt/dlpo/génesis

If you found one or you have a rule that should be working and it is not:

  • Paste the phrase on text analyser (Text Analysis - LanguageTool)
  • Click ‘Show analysis’;
  • Verify in your dictionary that that form does not exist (or that is in fact missing);
  • Post here like this: word (token); base form (Lemma); wrong tag (abbreviated on Part-of-speech).

Cheers!

This is what appears in the text analysis:

“cosmos”:
cosmos cosmo cosmos NCMN000 NCMP000

“dica”:
dica NCMS000

“face”:
face NCFS000

“Génesis”:
génesis NCFS000

I believe the morphologic speller needs to be improved in order for these words to work properly.
:stuck_out_tongue:

I will add this later today.

With these contributions it is improving. There are still a few hundreds missing.
Thank you for your contribution.