Edit Rules while working with omegaT

macios · October 12, 2016, 3:53am

First of all, thank you.

Second, I’m a translator working with OmegaT and EN-US to PT-BR variant language pair. Some rules are currently wrong for PT-BR (I’m native) or at least OmegaT is pointing it that way.

We have a new grammar agreement since 2009 and many rules doesn’t seem do be updated accordingly.

My question: as a complete ignorant in programming, would I be able to edit some rules and fix them? Could I try to make them just for me? as a way of testing?

thanks in advance

dnaber · October 12, 2016, 7:54am

Thanks for your interest in improving LT. You can find the error detection rules for pt-BR here. Usually, no programming is needed other than editing XML files, as documented here. You can do that locally by running the stand-alone version of LT (i.e. the *.zip that can be downloaded at https://languagetool.org).

macios · October 12, 2016, 2:20pm

Nice!

Thank you for the fast reply.

One question: I could see almost all the rules, but one of them concerns hyphened words. That doesn’t seem to be written as a rule in the xml file.

Is there a way I can change the list of words that should or shouldn’t be hyphened?

Thanks again.

dnaber · October 12, 2016, 3:07pm

I forgot to mention there are also rules not specific to pt-BR, but that apply to both variants of Portuguese here: languagetool/grammar.xml at master · languagetool-org/languagetool · GitHub

Also, some non-XML rules are at Browse LanguageTool Rules: 2.839 matches for Portuguese.

Can you provide an example of the issue you’re seeing with a hyphenated word?

macios · October 12, 2016, 3:36pm

Sure.

e.g.: When I write “dia a dia” OmegaT reports an issue saying “this word is hyphened”. It is not.

The same happens to “preto e branco”.

At least in PT-BR.

Thanks again.

dnaber · October 12, 2016, 5:31pm

These examples activate a rule called PreReformPortugueseCompoundRule which takes its data from pre-reform-compounds.txt. Maybe @matheuspoletto can comment if it still makes sense to have the pre-reform as default. We cannot simply add a selector for pre/post reform, the details are discussed at Portuguese - pre+ post+ reform · Issue #96 · languagetool-org/languagetool · GitHub.

matheuspoletto · October 13, 2016, 2:07pm

Hello Daniel and mat;
I agree with mat, when we write “dia a dia” the system say its a error but is not. I think is a nice idea have pos-reform as default once it is the correct in pt. And, @marcoagpinto can give his opinion here too, cause he works more time with this theme pre/post which is applied to pt-br and pt-eu.

Regards.

macios · October 13, 2016, 7:21pm

LT also indicate:

Em uma - “you mean: numa” - Also not necessarily correct in PT-BR.

Would it be possible to have specific files/rules for each language PT-EU / PT-BR, not mixing them together?

Again, I’m only a translator interested on all the fantastic resources from OmegaT and LT and I’m just trying to suggest possible improvements, but I don’t know much about how to do it.

Let me know in case I’m being too annoying.

Cheers,

matheuspoletto · October 14, 2016, 5:32pm

Think big when you use the LanguageTool! The possibilities are endless . You can manipulate the rules when using the stand-alone version for example. In other words, you can create, alter and even remove a specific rule from PT-EU (or they all) if you need. Go in folder LanguageTool-3.5\LanguageTool-3.5\org\languagetool\rules\pt and open grammar.xml, this file have all the rules wrotes to PT-EU. For your topic problem, just remove rule EM_UMA and done, LT interface will not suggest for you “em uma” is an error now.
Also if you using the LanguageTool in a java program for example, you can use disableRules method to stop a rule.

If you need help, contact me in facebook Redirecting... or in mail motaviop@gmail.com and i help you with what you need. Even create new rules for the pt-BR.

Regards.

tiagosantos · October 14, 2016, 11:55pm

Good evening,

I am also a Portuguese translator, and in the last months I have been trying to improve pt-PT proofing tools used in LibreOffice.
So far I have managed to make some improvements to auto-correction, spell-checker (still under revisions) and thesaurus. Now I have started learning how to improve Language tools and I would like to share here the first “patch", since it can be also used in other Portuguese variants (like pt-BR).
This pack of rules expands the existing rules for article_nouns congruency.

PS - I am trying to show the code but the XML tags are processed and they are not shown in this post. Any better way to paste and share the rules from the rules’ editor? I tried the <“xmp id=“snippet-container””> <"/xmp"> (without the ") but it doesn’t work.

Replace Rule: A_FEMALE-PLURAL_SINGULAR-VERB

<!-- Portuguese rule, 2016-10-15 -->
<rule id="ERRO_DE_CONCORDNCIA_DO_PLURAL_FEMALE_ARTICLE_FEMALE_SINGULAR_SINGULAR_VERB" name="Erro de concordância do plural ARTIGOS_FEMININOS + FEMALE SINGULAR + SINGULAR VERB">
 <pattern>
  <token regexp='yes'>a|à|da|na|uma|duma|numa|alguma|pela</token>
  <marker>
  <token postag='NCFP000'></token>
  </marker>
  <token postag='VMIP3S0'></token>
 </pattern>
 <message>Erro de concordância do plural: <suggestion><match no="2"/></suggestion></message>
 <example correction=''>A <marker>vacas</marker> está no pasto.</example>
</rule>

Replace Rule: AS_FEMALE-SINGULAR_PLURAL-VERB

<!-- Portuguese rule, 2016-10-15 -->
<rule id="ERRO_DE_CONCORDNCIA_DO_PLURAL_PLURAL_FEMALE_ARTICLE_FEMALE_PLURAL_PLURAL_VERB" name="Erro de concordância do plural ARTIGOS_FEMININOS_PLURAL + FEMALE PLURAL + PLURAL VERB">
 <pattern>
  <token regexp='yes'>as|às|das|nas|umas|dumas|numas|algumas|pelas</token>
  <marker>
  <token postag='NCFS000'></token>
  </marker>
  <token postag='VMIP3P0'></token>
 </pattern>
 <message>Erro de concordância do plural: <suggestion><match no="2"/></suggestion></message>
 <example correction=''>Algumas <marker>vaca</marker> estão no pasto.</example>
</rule>

Replace Rule: O_MALE-PLURAL_SINGULAR-VERB

<!-- Portuguese rule, 2016-10-15 -->
<rule id="ERRO_DE_CONCORDNCIA_DO_PLURAL_ARTIGO_MASCULINO_MALE_SINGULAR_SINGULAR_VERB" name="Erro de concordância do plural ARTIGO MASCULINO + MALE SINGULAR + SINGULAR VERB">
 <pattern>
  <token regexp='yes'>o|ao|do|no|um|dum|num|algum|pelo</token>
  <marker>
  <token postag='NCMP000'></token>
  </marker>
  <token postag='VMIP3S0'></token>
 </pattern>
 <message>Erro de concordância do plural: <suggestion><match no="2"/></suggestion></message>
 <example correction=''>Dum <marker>bois</marker> está no pasto.</example>
</rule>

Replace Rule: OS_MALE-SINGULAR_PLURAL-VERB

<!-- Portuguese rule, 2016-10-15 -->
<rule id="ERRO_DE_CONCORDNCIA_DO_PLURAL_OS_MALE_PLURAL_PLURAL_VERB" name="Erro de concordância do plural OS + MALE PLURAL + PLURAL VERB">
 <pattern>
  <token regexp='yes'>os|aos|dos|nos|uns|duns|nuns|alguns|pelos</token>
  <marker>
  <token postag='NCMS000'></token>
  </marker>
  <token postag='VMIP3P0'></token>
 </pattern>
 <message>Erro de concordância do plural: <suggestion><match no="2"/></suggestion></message>
 <example correction=''>Os boi estão no pasto.</example>
</rule>

jaumeortola · October 15, 2016, 7:40am

You can use “Preformatted text” (icon: </>) or put the code between `…` for inline snippets. For example:

<rule id="CONFUSION_OF_BED_BAD" name="confusion of bed/bad">
    <pattern>
        <token>bed</token>
        <token>English</token>
    </pattern>
    <message>Did you mean <suggestion>bad English</suggestion>?</message>
    <example correction="bad English">Sorry for my <marker>bed English</marker>.</example>
</rule>

marcoagpinto · October 15, 2016, 9:46am

@tiagosantos

Tiago,

I will have a look at it on Monday since I have the weekend job.

Please notice that some of the rules won’t be accurate though, just look at:
“Dum bois está no pasto.” and replace “dum” with the other articles and some won’t make sense.
Replace Rule: O_MALE-PLURAL_SINGULAR-VERB

I have to look carefully at each article.

Thanks!

marcoagpinto · October 15, 2016, 10:00am

@tiagosantos
The code snippets vanished when I downloaded from the server.

Could you repost them like jaumeortola suggested?

Also, could you post an example for each article?

I ask for examples because some of them don’t make sense.

Thanks!

Kind regards,

marcoagpinto · October 15, 2016, 10:25am

@tiagosantos

Looking at the source of the messages in Thunderbird, I can see the snippets.

I believe your extra articles could be used in a new rule:

“a vaca que/quando/enquanto/ainda está no pasto.”
This would make sense with:
a|à|da|na|uma|duma|numa|alguma|pela

I will try to replace que/quando/enquanto/ainda with a postag to hit any forgotten terms.

Now I need to go… I will return at night… take care

dnaber · October 15, 2016, 10:53am

I’ve edited Tiago’s post so the markup is visible again.

tiagosantos · October 15, 2016, 11:29am

Hello everybody,

This reply was fast. Many thanks.

@marcoagpinto
´Dum=de+um´ was a poorly used example since it is not usually used in formal language.
But it can be applied. For example: Dum (de um) lado(s) da bancada.
I should not have included it, but this rule has been very improved in the meantime.

Yesterday I figured the code a bit more and improved the code, so these snippets should not be used.
I have added regex variations so that concordance rules apply to more cases, as the general grammar rules pointout.
I have tested it in my local build with newspaper articles and it is working great but more refinements can be done. I will be working on them in the next week. I have also extended the markup to both words. This alerts the user better to the issue and allows a better choice of the fix to be done, specially on number concordance, that should follow extended RegEx rules like the ones I created for gender concordance (see first two rules).

@dnaber
Sorry for the dumb question, but… I can not see the <> icon. I am using Opera and Firefox. I added … tags as well as ‘…’ but these do not seam to work either.

<rule id="ERRO_DE_CONCORDNCIA_DO_FEMININO_PLURAL" name="Erro de concordância do feminino singular"> <pattern> <marker> <token postag='D[AI]0FP0|NCFP000|AQ0FP0' postag_regexp='yes'> <exception postag='CC|CS|RG|RN|SPS00' postag_regexp='yes'></exception></token> <token postag='NCFS000|AQ0FS0' postag_regexp='yes'> <exception postag='P[ID][0123][CFM][SP]000|CC|CS|RG|RN|SPS00' postag_regexp='yes'></exception></token> </marker> </pattern> <message>Erro de concordância do plural. <suggestion><match no="1" postag_match="(DA0FP0|NCFP000|AQ0FP0)" postag_regexp="yes" postag_replace="DA0FS0|NCFS000|AQ0FS0"/> <match no="2"/></suggestion> </message> <example correction=''><marker>As vaca</marker> são falcadas.</example> </rule>

Rules moved to new forum post for easier reading and neatiness.
Reduced the focus of the available rules. Some changes on the ones I want to be reviewed at the moment.

dnaber · October 15, 2016, 11:46am

The icon is </>, the 6th icon, directly left to the upload icon. You can also use three back ticks (not dots) to start and end the markup.

tiagosantos · October 15, 2016, 11:49am

Now I see see it! Thank you again for the edit.

tiagosantos · October 15, 2016, 3:12pm

Forgot to thank you for the input. Cheers.

tiagosantos · October 15, 2016, 8:17pm

Ok. I kept on improving the rules. Today I was able to trim down most borderline cases with pronouns and prepositions.
Now I am stuck on the suggestions part.
In the following rule:

<rule id="ERRO_DE_CONCORDNCIA_DO_FEMININO_PLURAL" name="Erro de concordância do feminino singular"> <pattern> <marker> <token postag='DA0FP0|NCFP000|AQ0FP0' postag_regexp='yes'> <exception postag='CC|RN|SPS00' postag_regexp='yes'></exception></token> <token postag='NCFS000|AQ0FS0' postag_regexp='yes'> <exception postag='P[ID][0123][CFM][SP]000|CC|RN|SPS00' postag_regexp='yes'></exception></token> </marker> </pattern> <message>Erro de concordância do plural. <suggestion><match no="1" postag_match="(DA0FP0|NCFP000|AQ0FP0)" postag_regexp="yes" postag_replace="DA0FS0|NCFS000|AQ0FS0"/> <match no="2"/></suggestion> </message> <example correction=''><marker>Linda vacas</marker> são falcadas.</example> </rule>

My intention is to use RegEx to suggest : ‘1st word singular derivative’ + ‘2nd word as is’.
I read thoroughly the developers manual and it seams possible as shown in the example:

<match no="1" postag="(adj|ppas|pact):sg:inst.*(:pos)" postag_regexp="yes" postag_replace="$1:sg:.*nom.*:n1\.n2.*$2"></match>

Anyone knows an example from other language that I can use as reference?