European Portuguese (PT-PT) rule contributions

@tiagosantos

Marco “The Gate Keeper” has added your fixes plus your new rules.

Please notice that I have tested your fixes and they still produce false positives.

Way of testing them:
https://dl.dropboxusercontent.com/u/30674540/Tiago_Santos_-Concordancia_Singular_Plural-_20161019.odt

I opened the HTML that had the night results and copy/paste into LO and saved as .ODT.

I downloaded the latest nightly OXT and added the grammar.xml into it after converting the OXT to ZIP.

Here is the latest grammar.xml:
https://dl.dropboxusercontent.com/u/30674540/grammar_v1_186.zip

Tiago, please try to add examples to the rules, so that the stand-alone tool shows them.

Thanks!

https://github.com/languagetool-org/languagetool/commit/3db4169d98e7d7a2bf0b618a6c9ebde7e0fa3ec3

Perfect solution.
That will fix any pending issues with the few words that lack morphological informations or that are improperly cataloged.

If I understand correctly we just need a removed.txt with " oo oo NCMP000 " fixes the suggestions issue.

I will in due time, but first things first. Probably it can be instantly fixed by:

Sure the examples have some importance but with so many more relevant things to be done, I prefer not to focus with details.
Those regression tests are very useful to verify that type of issues so I will keep an eye on them and I will adjust the new rules accordingly.

@tiagosantos

I have added “oo” to the file.

“cãos” is also in the morphologic dictionary wrongly.

How do I add it to the removed.txt?

It says on the analysis:
cãos cão
AQ0MP0
NCMP000

I want to be sure I will do it right.

I changed Java code for support “removed.txt” in Portuguese,
and improved “removed.txt”:

oo o NCMP000

@Yakov
Many thanks Yakov. Now anyone can easily fix the dictionary in a way that changes can be reviewed by anyone.

@marcoagpinto

Checking the regression test, the results with the new rules have been great. Considering that there are 6 more rules the end of the day result is this:

-Portuguese: 4468 total matches -Portuguese: ø0,11 rule matches per sentence +Portuguese: 3849 total matches +Portuguese: ø0,10 rule matches per sentence

Considering that some false positive are actually valid grammar corrections is even better:

`
+Line 1, column 132, Rule ID: ERRO_DE_CONCORDNCIA_DO_NMERO_DO_VERBO_3P[1]
+Message: Erro de concordância verbal.
+… mais bem servidos nessa área, ainda que em todos eles haja grandes

  •                                                                                            ^^^^^^^^
    

+Line 1, column 1, Rule ID: ERRO_DE_CONCORDNCIA_DO_NMERO_DO_VERBO_1S[1]
+Message: Erro de concordância verbal.
+Eu costuma jogar frequentemente tênis com ele nos domingos.
+^^^^^^^^^^ `

We can even reduce this a bit further by adding to the new removed.txt this:

oo oo NCMP000 cãos cãos NCMP000 cãos cãos AQ0MP0 uma uma VMIP2S0 uma uma VMIP2S0 umas umas VMIP2S0

I was going to post all XML rules for punctuation, but many of the rules I have recreated are available but inactive by default in the LO extension.

They are active for other languages in the same build environment. Is there any pertinent bug that require them to be predefined as inactive for the Portuguese language?

The JAVA rules are ative by default in most (all?) other languages. The ones I have noticed that are inactive by default specifically in Portuguese are: “Capitalization”, “Word repetition”, “Double spacing” and both “Punctuation rules”.

When you have time, can you verify this?

For the verbal forms I will add an exception for the ‘e’ (and) before ‘eu|tu|você|ele|ela’ as well as an exception for the controversial haver inflections of the verb “haver”. This will further reduce the false positives.

On the other way, some of the extra “false positives” from yesterday were actually valid corrections, and they are the ones bloating the score.
+Os espanhóis abriram muitas mina de prata em suas

While I was reviewing this I was able to find a few easy more correction to add to removed.txt.

oo oo NCMP000 cãos cãos NCMP000 cãos cãos AQ0MP0 uma uma VMIP2S0 uma uma VMIP2S0 umas umas VMIP2S0 o o NCMS000 há há NCMS000 fez fez NCMS000 imperador imperador NCMP000 imperador imperador AQ0MP0

@tiagosantos
I will add the words to the removed.txt in the morning after I get out of bed for real.

The merge will only happen at 10pm anyway.

Now it is 6am and I just came to check the e-mails.

Tiago, may I add your name to:
http://marcoagpinto.cidadevirtual.pt/getting_involved.html
in the part of LanguageTool?
I will release an update just to add your name.

Thanks!

@tiagosantos

I can’t sleep so I decided to add the words:

[pt] Added the words Tiago suggested:
oo oo NCMP000
cãos cãos NCMP000
cãos cãos AQ0MP0
uma uma VMIP2S0
umas umas VMIP2S0
o o NCMS000
há há NCMS000
fez fez NCMS000
imperador imperador NCMS000
imperador imperador AQ0MP0

Noticed that you had:

uma uma VMIP2S0

twice

Now I will try to add a rule or two, since I am awaken.

It is your site, so it is up to you, but that is not required. The regular contributor credits in the “About dialog” are enough.

Nice find, That was another wrong copy-paste. My apologies for that.
Fixed ‘uma’ and a few more troubling words.

oo oo NCMP000
cãos cãos NCMP000
cãos cãos AQ0MP0
uma uma VMIP3S0
uma uma VMM02S0
umas umas VMIP2S0
o o NCMS000
há há NCMS000
fez fez NCMS000
imperador imperador NCMS000
imperador imperador AQ0MP0
eu eu NCMS000
nós nós NCMP000

Marco, this is going quite well. I can go slower if you do not have the time. Just do not disparage the work before looking at it properly. I know that what I am asking and sending takes time to review.

After you review the impact of your new rules and the removed words in the regression tests as well as the LO extension punctuation configurations, I will post the perfected the detection of subjects in verbal concordance rules.

Now it is able to detect compound subjects and it will reduce the false positives count even further. This is needed before the next significant set of rules that I want to share, gender concordance rules. They will raise the false positives a bit and they may require more exceptions added to the rule set.

@tiagosantos

Daniel is going to give you commit rights.

Anyway, I have committed the removed.txt you just sent.

Thanks,

:slight_smile:

Awesome and many thanks Marco.
Anyway, I will keep pacing the commits in a way that allows you to review, find potencial problems and suggest fixes.

Cheers.

I think it is necessary to add the base form of words to the list like:

oo o NCMP000
cãos cão NCMP000
cãos cão AQ0MP0
umas umar VMIP2S0

I will fix that accordingly. Only momments ago, I downloaded the git copy. When I fell more confident with git I will push the updated version.

@tiagosantos

Hello!

Yesterday, I kind of finished my thesis+project, so I opened the thesis with LibreOffice and LT.

I spent hours creating a list of missing words for the pt_PT speller from Minho University (who replied saying they will add them when they have the time).

I will try soon to post here a list of possible false positives, maybe after the nightly.

All I remember was that “NATO” gave a gender error, so moments ago I went to the morphological page and it appears as a normal word:
NATO nato AQ0MS0

because it recognises it as a normal word only and not as the NATO organisation.

Could you add “NATO” as well?

Thanks!

Kind regards,

Hello Marco,

Congratulation on the completion of your thesis.

The best way is to add them yourself, since there are always more words that can be added.
They should be placed in languagetool-language-modules/pt/src/main/resources/org/languagetool/resource/pt in the file added.txt. Do not forget the lemma and that in this file, columns must be tab separated.

It would also be interesting if you get acquainted with:
http://wiki.languagetool.org/archive-developing-a-tagger-dictionary
http://wiki.languagetool.org/developing-a-tagger-dictionary

These additions and removals work better if they are integrated in the main morphological dictionary and synthetizer, after a reasonable test period in the added.txt and removed.txt files.

Having a separate project with those lists (in text files so they can be reviewed) and updating the binaries only once each realease, similar to the work in German and Catalan projects, would be ideal.

Best regards

Regarding
NATO nato AQ0MS0 you could add the tag
NATO NATO NP0FS0 and it would fix gender concordance false positives, but I am not sure it would not introduce other errors in the ‘nato’ adjective. If I am not mistaken, the speller and rules are not case sensitive. Either way, it is a good addition.

Any specific reason you’re pointing to the archived version of the page? Doesn’t Developing a tagger dictionary - LanguageTool Wiki work?

No good reason. First link I gathered and read. I have not changed afterwards due to this:

The manual process of creating and exporting a dictionary is documented at the Archive.

I will add the new link as well to both posts.