European Portuguese (PT-PT) rule contributions

marcoagpinto · October 18, 2016, 9:29pm

Before adding a lot of rules at once I always like to wait for the nightly diffs so that issues can be fixed before placing more rules that would increase complexity.

Your rules generated TONS of false hits:
http://languagetool.org/regression-tests/20161018/result_pt_20161018.html

Until they are fixed I am not adding any more rules.

tiagosantos · October 18, 2016, 9:39pm

@Yakov

Ideally it would be removed from the dictionary. I believe it is an artifact from OpenOffice (OO), since it appers in more places.
Changing the tag is not a solution. This rule identifies determinants, nouns and adverves and checks for number congruency. Removing NCMP000 tag would exclude all male (M) common © nouns (N).
I have read the developer documentation but I do not remember seeing any tag from match exclusions. That would allow to remove the “Oo” and other possible errors from the results.

@marcoagpinto
There are errors, it is true, but this is not the case. It is just an abbreviation that should not be there.

This morfological dictionary is in binary form and, as such, it is not editable. It is impossible to recreate a morfological dictionary with the available open Natura dictionaries (ISpell/MySpell/Hunspell based) because they lack the extensive morfological information that this one has (although they might have more words).
Trying that approach would revert a great deal of the work that was (very well) done on it and would create an inferior morfological dictionary.

The morfological dictionary is the core of any grammar tool. Ideally you contact the author and ask him/her the base dictionary, or, request a new binary with the typos you want fixed. I would do that myself but the auther is not referenced in any place I can find.

tiagosantos · October 18, 2016, 10:10pm

@marcoagpinto

We both know that this 10 rules cover more grammatical error situations than then majority of the existing rules combined. They produce false positives in situations of compound subjects, but is nice to be warned about this type of inconsistencies.

The false positives appear mostly in XX (dictionary should have it as number or neutral noun) and são (which is a plural verb and a singular noun for sane). I have not excluded verbs from the analysis because there are often issues with verbal consistencies but rules can and will be developed.
Remember that this type of false positives also happen in commercial software solutions.

Those are the kind of rules that the user expects from a very basic grammar checker, so they should be here from day 1. See:

It is easier to work if willingness to create solutions and constructive criticism is presented. Understanding the scope of each rule and pointing out were the issue may lay is also welcome. There is still a lot of basic rules to be implemented and they are long overdue.

PS - This post has been edited. My apologies to Marco for the excessive reaction and I hope to compensate by the late hour post edit and reworked of the rules with the information from the test results where I fixed the most common false positives, working around dictionary issues.

marcoagpinto · October 18, 2016, 10:18pm

tiagosantos:

There are errors, it is true, but this is not the case. It is just an abbreviation that should not be there.

This morfological dictionary is in binary form and, as such, it is not editable. It is impossible to recreate a morfological dictionary with the available open Natura dictionaries (ISpell/MySpell/Hunspell based) because they lack the extensive morfological information that this one has (although they might have more words). Trying that approach would revert a great deal of the work that was (very well) done on it and would create an inferior morfological dictionary.

The morfological dictionary is the core of any grammar tool. Ideally you contact the author and ask him/her the base dictionary, or, request a new binary with the typos you want fixed. I would do that myself but the auther is not referenced in any place I can find.

We already had this discussion years ago in the mailing list and with Minho University:

Minho’s University has the needed information in the speller… I only need to know and have the time to develop some code that would convert the speller to binary.

Then we could join prereform + postreform, delete the duplicates and create a binary.

tiagosantos · October 19, 2016, 3:19am

Rule with extra exceptions for dictionary issues. Demonstrative and possissive determinantes added to the logic.

<rule id="ERRO_DE_CONCORDNCIA_DO_MASCULINO_PLURAL_OS_O" name="Erro de concordância do masculino plural"> <pattern> <marker> <token postag='D[ADIP][0123]MP[0P]|NC[MC]P000|AQ0[MC]P0' postag_regexp='yes'> <exception postag='CC|CS|RG|RN|SPS00' postag_regexp='yes'></exception> <exception regexp="yes">[IDMVX]|[IDMVX][IDMVX]|[IDMVX][IDMVX][IDMVX]</exception> </token> <token postag='NC[MC]S000|AQ0[MC]S0' postag_regexp='yes'> <exception postag='P[ID][0123][CFM][SP]000|CC|CS|RG|RN|SPS00' postag_regexp='yes'></exception> <exception regexp="yes">há|são|ser</exception></token> </marker> </pattern> <message>Erro de concordância do plural. <suggestion><match no="1" postag="(D[ADIP][0123]MS[0S]|NC[MC]S000|AQ0[MC]S0)" postag_regexp="yes"/> <match no="2"/></suggestion> ou <suggestion><match no="1"/> <match no="2" postag="(NC[MC]P000|AQ0[MC]P0)" postag_regexp="yes"/></suggestion>. </message> <example correction='O cão|Os cães|Os cãos'><marker>Os cão</marker> está no pasto.</example> </rule>

tiagosantos · October 19, 2016, 3:20am

`

há|são|ser

[IDMVX]|[IDMVX][IDMVX]|[IDMVX][IDMVX][IDMVX]

Erro de concordância do plural.

  </message>
 <example correction=''><marker>O cães</marker> estão no pasto.</example>
</rule>

`

tiagosantos · October 19, 2016, 3:22am

<!-- Concordance error plural - A > AS -->
<!-- Created by Tiago F. Santos, Portuguese rule, 2016-10-15 -->
<rule id="ERRO_DE_CONCORDNCIA_DO_FEMININO_PLURAL_A_AS" name="Erro de concordância do feminino plural">
 <pattern>
  <marker>
  <token postag='D[ADIP][0123]FS[0S]|NC[FC]S000|AQ0[FC]S0' postag_regexp='yes'>
  <exception postag='CC|CS|RG|RN' postag_regexp='yes'></exception></token>
  <token postag='NC[FC]P000|AQ0[FC]P0' postag_regexp='yes'>
  <exception postag='P[ID][0123][CFM][SP]000|CC|CS|RG|RN|SPS00' postag_regexp='yes'></exception></token>
  </marker>
 </pattern>
 <message>Erro de concordância do plural:
  <suggestion><match no="1" postag="(D[ADIP][0123]FP[0P]|NC[FC]P000|AQ0[FC]P0)" postag_regexp="yes"/> <match no="2"/></suggestion> ou <suggestion><match no="1"/> <match no="2" postag="(NC[FC]S000|AQ0[FC]S0)" postag_regexp="yes"/></suggestion>.
  </message>
  <example correction='As vacas|A vaca'><marker>A vacas</marker> são malhadas.</example>
</rule>

tiagosantos · October 19, 2016, 3:24am

As -> A and A -> As only have extras range. Demonstrative and possissive determinantes added to the logic.

<rule id="ERRO_DE_CONCORDNCIA_DO_FEMININO_PLURAL_AS_A" name="Erro de concordância do feminino singular"> <pattern> <marker> <token postag='D[ADIP][0123]FP[0P]|NC[FC]P000|AQ0[FC]P0' postag_regexp='yes'> <exception postag='CC|CS|RG|RN|SPS00' postag_regexp='yes'></exception></token> <token postag='NC[FC]S000|AQ0[FC]S0' postag_regexp='yes'> <exception postag='P[ID][0123][CFM][SP]000|CC|CS|RG|RN|SPS00' postag_regexp='yes'></exception></token> </marker> </pattern> <message>Erro de concordância do plural. <suggestion><match no="1" postag="(D[ADIP][0123]FS[0S]|NC[FC]S000|AQ0[FC]S0)" postag_regexp="yes"/> <match no="2"/></suggestion> ou <suggestion><match no="1"/> <match no="2" postag="(NC[FC]P000|AQ0[FC]P0)" postag_regexp="yes"/></suggestion>. </message> <example correction='A vaca|As vacas'><marker>As vaca</marker> são malhadas.</example> </rule>

dnaber · October 19, 2016, 7:26am

You can, however, use added.txt and removed.txt as a workaround to fix items without touching the binary dictionary. See here and here for an example in English, the same idea will work for Portuguese.

marcoagpinto · October 19, 2016, 8:38am

@tiagosantos

Marco “The Gate Keeper” has added your fixes plus your new rules.

Please notice that I have tested your fixes and they still produce false positives.

Way of testing them:
https://dl.dropboxusercontent.com/u/30674540/Tiago_Santos_-Concordancia_Singular_Plural-_20161019.odt

I opened the HTML that had the night results and copy/paste into LO and saved as .ODT.

I downloaded the latest nightly OXT and added the grammar.xml into it after converting the OXT to ZIP.

Here is the latest grammar.xml:
https://dl.dropboxusercontent.com/u/30674540/grammar_v1_186.zip

Tiago, please try to add examples to the rules, so that the stand-alone tool shows them.

Thanks!

marcoagpinto · October 19, 2016, 8:54am

https://github.com/languagetool-org/languagetool/commit/3db4169d98e7d7a2bf0b618a6c9ebde7e0fa3ec3

tiagosantos · October 19, 2016, 1:36pm

Perfect solution.
That will fix any pending issues with the few words that lack morphological informations or that are improperly cataloged.

If I understand correctly we just need a removed.txt with " oo oo NCMP000 " fixes the suggestions issue.

tiagosantos · October 19, 2016, 1:42pm

I will in due time, but first things first. Probably it can be instantly fixed by:

Sure the examples have some importance but with so many more relevant things to be done, I prefer not to focus with details.
Those regression tests are very useful to verify that type of issues so I will keep an eye on them and I will adjust the new rules accordingly.

marcoagpinto · October 19, 2016, 2:17pm

@tiagosantos

I have added “oo” to the file.

“cãos” is also in the morphologic dictionary wrongly.

How do I add it to the removed.txt?

It says on the analysis:
cãos cão
AQ0MP0
NCMP000

I want to be sure I will do it right.

Yakov · October 19, 2016, 7:22pm

I changed Java code for support “removed.txt” in Portuguese,
and improved “removed.txt”:

oo o NCMP000

tiagosantos · October 19, 2016, 10:05pm

@Yakov
Many thanks Yakov. Now anyone can easily fix the dictionary in a way that changes can be reviewed by anyone.

@marcoagpinto

Checking the regression test, the results with the new rules have been great. Considering that there are 6 more rules the end of the day result is this:

-Portuguese: 4468 total matches -Portuguese: ø0,11 rule matches per sentence +Portuguese: 3849 total matches +Portuguese: ø0,10 rule matches per sentence

Considering that some false positive are actually valid grammar corrections is even better:

`
+Line 1, column 132, Rule ID: ERRO_DE_CONCORDNCIA_DO_NMERO_DO_VERBO_3P[1]
+Message: Erro de concordância verbal.
+… mais bem servidos nessa área, ainda que em todos eles haja grandes

                                                                                           ^^^^^^^^

+Line 1, column 1, Rule ID: ERRO_DE_CONCORDNCIA_DO_NMERO_DO_VERBO_1S[1]
+Message: Erro de concordância verbal.
+Eu costuma jogar frequentemente tênis com ele nos domingos.
+^^^^^^^^^^ `

We can even reduce this a bit further by adding to the new removed.txt this:

oo oo NCMP000 cãos cãos NCMP000 cãos cãos AQ0MP0 uma uma VMIP2S0 uma uma VMIP2S0 umas umas VMIP2S0

I was going to post all XML rules for punctuation, but many of the rules I have recreated are available but inactive by default in the LO extension.

They are active for other languages in the same build environment. Is there any pertinent bug that require them to be predefined as inactive for the Portuguese language?

The JAVA rules are ative by default in most (all?) other languages. The ones I have noticed that are inactive by default specifically in Portuguese are: “Capitalization”, “Word repetition”, “Double spacing” and both “Punctuation rules”.

When you have time, can you verify this?

tiagosantos · October 19, 2016, 10:54pm

For the verbal forms I will add an exception for the ‘e’ (and) before ‘eu|tu|você|ele|ela’ as well as an exception for the controversial haver inflections of the verb “haver”. This will further reduce the false positives.

On the other way, some of the extra “false positives” from yesterday were actually valid corrections, and they are the ones bloating the score.
+Os espanhóis abriram muitas mina de prata em suas

While I was reviewing this I was able to find a few easy more correction to add to removed.txt.

oo oo NCMP000 cãos cãos NCMP000 cãos cãos AQ0MP0 uma uma VMIP2S0 uma uma VMIP2S0 umas umas VMIP2S0 o o NCMS000 há há NCMS000 fez fez NCMS000 imperador imperador NCMP000 imperador imperador AQ0MP0

marcoagpinto · October 20, 2016, 5:05am

@tiagosantos
I will add the words to the removed.txt in the morning after I get out of bed for real.

The merge will only happen at 10pm anyway.

Now it is 6am and I just came to check the e-mails.

Tiago, may I add your name to:
http://marcoagpinto.cidadevirtual.pt/getting_involved.html
in the part of LanguageTool?
I will release an update just to add your name.

Thanks!

marcoagpinto · October 20, 2016, 5:46am

@tiagosantos

I can’t sleep so I decided to add the words:

[pt] Added the words Tiago suggested:
oo oo NCMP000
cãos cãos NCMP000
cãos cãos AQ0MP0
uma uma VMIP2S0
umas umas VMIP2S0
o o NCMS000
há há NCMS000
fez fez NCMS000
imperador imperador NCMS000
imperador imperador AQ0MP0

Noticed that you had:

uma uma VMIP2S0

twice

Now I will try to add a rule or two, since I am awaken.

tiagosantos · October 20, 2016, 2:10pm

It is your site, so it is up to you, but that is not required. The regular contributor credits in the “About dialog” are enough.

Nice find, That was another wrong copy-paste. My apologies for that.
Fixed ‘uma’ and a few more troubling words.

oo oo NCMP000
cãos cãos NCMP000
cãos cãos AQ0MP0
uma uma VMIP3S0
uma uma VMM02S0
umas umas VMIP2S0
o o NCMS000
há há NCMS000
fez fez NCMS000
imperador imperador NCMS000
imperador imperador AQ0MP0
eu eu NCMS000
nós nós NCMP000

Marco, this is going quite well. I can go slower if you do not have the time. Just do not disparage the work before looking at it properly. I know that what I am asking and sending takes time to review.

After you review the impact of your new rules and the removed words in the regression tests as well as the LO extension punctuation configurations, I will post the perfected the detection of subjects in verbal concordance rules.

Now it is able to detect compound subjects and it will reduce the false positives count even further. This is needed before the next significant set of rules that I want to share, gender concordance rules. They will raise the false positives a bit and they may require more exceptions added to the rule set.