Back to LanguageTool Homepage - Privacy - Imprint

[pt] suggest substantive instead of accent-less verb form


(Enno) #1

In the orthographically correct Portuguese phrase:

    No silencio (In the I silence),

what is meant is:

    No silêncio (In the silence).

There are several substantives in Portuguese that have verb forms that in singular present (1. person singular if ending when -o, 3. person singular when ending in -a), are orthographically equal up to missing accent marks; most of them terminating in

    -ância, -êncio, -ário, -ática

and their corresponding verb forms

    -anciar, -enciar, -ariar, aticar.

For example,

    silêncio/silenciar, substância/substanciar, dicionário/dicionariar, gramática/gramaticar.

Therefore, how about checking if the supposed substantive is preceded by an article, that is, if it is preceded by the regular expression

    (d|n|pel)?[ao]|(d|n)?uma?

Lightproof in LibreOffice does something similar.


(Tiago F. Santos) #2

Great suggestion! I have pushed a rule set with these examples.

I took the liberty to add a few more possibilities to the mix. Please verify if these solutions are in line with your idea.


Probably we can expand even further simply replacing the regexp expression with postags, i.e.
<token postag='(D|P|SP).*' postag_regexp='yes'></token>
, but, I have not tested, so, I need confirmation.

I adapted my spreadsheet for this Paronyms situation, but I can not find good lists on-line with compatible license.
If you can provide me today a table with a few dozens/hundreds of these exact type of examples, we can see them integrated in the snapshot by tomorrow.

Please, post the list in the following line format:
'substantive'+'TAB'+'contraction/preposition to be use in the example'+'singular paronym form of the verb'

If they are recognized by a different struture, send also the corresponding regexp, line format and formatted examples.

Cheers!

PS - Notice than some examples like, 'dicionario' and 'gramatica' would already be detected by the spellchecker.


(Enno) #3

Dear Tiago,

Thanks for your help! Here's a continuously updated list of substantives and their verb forms:

portuguese verbalized substantives

Please explain the exact format again, at the moment, they are in the form

substantive verb-form

Existence has been checked against the Aurélio, Houaiss and Michaelis dictionaries and the portuguese spell checker of the editor Vim.


(Tiago F. Santos) #4

Dear Enno,

Today it was impossible to push anything else, so I will see if it is possible to do it tomorrow.
That list is great and it is very easy to adapt to the form I need, so I will add it ASAP, as well as, your GitHub suggestion (adding to the recognized error patterns an optional adjective between the article and the noun/verb).

Thank you for verifying the validity in multiple dictionaries. I only check words against the European Portuguese spellchecker used by the system, but that is a good procedure that I will adopt from now on.

Portuguese also uses a morphological lexicon, so it is possible to match any adjective easily with the POSTAG='A.*'.
I though you were familiar with the code and that was what I meant when I said:

This would look for all determinants (D.), prepositions (P.) and contractions (SP.*) in the first token positions, instead of just recognizing the enumerated ones. Since I have not researched this rule, I do not know if this is adequate, but I will look into it later on.

I only now notice that your name and nickname are different. I have credited you as konfekt. Do you prefer to be credited in the code, by konfekt, Enno or something else?


(Enno) #5

Ah, postag is a positional tag. Thanks for introducing me to languagetool's syntax. Now it makes sense. Yes, then the postags rule is better fit; if it is robust after all.

Notice that the regular expressions differ for singular and plural forms (substancia / substancias). In the plural form, not only is there an extra terminating s (uma/umas), but now also numerals such as duas,três, ... are allowed.

Notice also that the male plural forms in the reg ex, for example uns, are not needed, because the male plural forms, such as silencios, are already recognized by the spell checker as incorrect. (See comments at github commit)

The plural forms that pass the spell check, such as substancias, figure now also in the list

You can credit me as Enno.


(Tiago F. Santos) #6

Having extra detections is not a problem and it is easier to construct. I just concatenate a optional s to every second string and it covers all situations. At leat to me, the extra compute time is negletible. That is why I included 'dicionario' and 'gramatica' in the first commit, despite being seeing them recognized by the spellchecker.

Same as above, and in the latest daily snapshots there are new concordance rules to verify these inconsistencies.

I will push now the list with the fixed credits.

PS - I will push the postag change tomorrow after testing it better.