Sugestão para corrigir erros de crase

Falta a implementação correta de erros quanta a grafia do acentro grave conhecido como crase. Em frases como: “Fui à farmácia” o corretor não identifica a falta da acentuação.

O LanguageTool 5.0 vai ser lançado na sexta-feira.

Até lá não farei nenhuma alteração para não correr riscos.

@jaumeortola

I can’t fix this rule.

Here is the code:

<rule>
  <antipattern>
      <token spacebefore='no' regexp='yes'>&hifen;</token>
      <token inflected='yes' spacebefore='no'>ir</token>
  </antipattern>
  <antipattern>
      <token inflected='yes'>ir</token>
      <token>a</token>
      <token postag_regexp='yes' postag='A...P.+' min='0'/>
      <token>eleições</token>
  </antipattern>
  <antipattern>
      <token>assim</token>
      <token inflected='yes' spacebefore='no'>ir</token>
      <token>a</token>
      <token>campanha</token>
  </antipattern>
  <pattern>
    <marker>
      <token inflected='yes' regexp='yes'>&requer_crase_verbos;
        <exception inflected='yes'>ser</exception></token>
      <token postag='R.' min='0' postag_regexp='yes'>
        <exception postag='C.+' postag_regexp='yes'/></token>
      <token regexp='yes'>as?</token>
    </marker>
      <token postag_regexp='yes' postag='N.F.+'>
        <exception postag='D..F.+|R.+' postag_regexp='yes'/></token>
  </pattern>
  <message>Esta palavra rege-se com a preposição "a".</message>
    <suggestion>\1 <match no='2' include_skipped='all'/> <match no='3' regexp_match='(a)(s?)' regexp_replace='à$2'/></suggestion>
  <url>https://pt.wikipedia.org/wiki/Crase</url>
  <short>Erro de crase</short>
  <example correction='Vamos às'><marker>Vamos as</marker> compras no supermercado.</example>
  <example correction='Iremos à'><marker>Iremos a</marker> escola, falar com os professores.</example>
<!--<example correction='foi à'>Ele <marker>foi a</marker> herdade.</example>--><!-- TODO Replace exception by 'ir' and 'ser' verbs disambiguation -->
  <example correction='adere à'>A cola <marker>adere a</marker> folha.</example>
  <example correction='Pertence provavelmente à'><marker>Pertence provavelmente a</marker> equipa dos seus amigos.</example>
  <example>A causa de sua morte foi a pneumonia.</example>
  <example>Os 6% restantes pertencem a outras nacionalidades.</example>
  <example>…terminada esta seguir-se-ia a construção…</example>
  <example>…viços de TV paga não se tornaram populares ou bem sucedidos enquanto as redes de televisão públicas ZDF e ARD oferecem um…</example>
</rule>

It works okay with “vamos”:

Vamos a farmácia.
Fomos a farmácia.
Fui a farmácia.

But with the verb “ir” it doesn’t trigger any error.

<!ENTITY requer_crase_verbos "(?:a(?:derir|gradar|ssistir)|comparecer|des(?:agradar|obedecer)|equivaler|ir|p(?:ertencer|roceder)|obedecer|re(?:agir|correr|sponder)|suceder)"><!-- accepts crase: avisar|limitar -->

As you can see, “ir” is there.

Why doesn’t it work?

Thanks!

“Fui” is also a form of verb “ser”, which is an exception.
<S> fui[ir/VMIS1S0,ser/VMIS1S0] a[a/SPS00] farmácia[farmácia/NCFS000,</S>]<P/>
We can disambiguate “fui a” as “ir”, if it doesn’t cause other problems, or we can repeat the rule for “ir” (even if it matches also “ser”).

@jaumeortola

How can we disambiguate it?

Can you help?

Thanks!

Is “fui (ser) a + feminine noun” (eu fui a mulher) a correct and frequent structure? If it is frequent, then it will be difficult to disambiguate. I would try this: repeat the same rule, but only with the verb “ir” and without the exception “ser”, with default=temp_off, and we will see in the tests how many false alarms it causes.

Thanks, I will repeat the same rule.

It is in my TO-DO list for this weekend.

I still want to enhance some other rules today.

Each Wikipedia Tool check takes 10 minutes for 200 000 sentences, and I always do a “before.txt” and an “after.txt” after I improve antipatterns.

@jaumeortola

It produces warnings in TESTRULES PT

<!-- MARCOAGPINTO 2020-06-28 *START* -->
<!-- "Fui a farmácia." -->
<!-- "Fui a praia." -->
<!-- "Fui a casa da Ana." -->
    <rule>
      <pattern>
        <marker>
          <token inflected='yes' regexp='no'>fui</token>
          <token regexp='yes'>as?</token>
        </marker>
          <token postag_regexp='yes' postag='N.F.+'>
            <exception postag='D..F.+|R.+' postag_regexp='yes'/></token>
      </pattern>
      <message>Esta palavra rege-se com a preposição "a".</message>
        <suggestion>\1 <match no='2' regexp_match='(a)(s?)' regexp_replace='à$2'/></suggestion>
      <url>https://pt.wikipedia.org/wiki/Crase</url>
      <short>Erro de crase</short>
      <example correction='fui à'>De manhã <marker>fui a</marker> praia.</example>
      <example>De manhã fui a farmácia.</example>
	  <example>De manhã fui a praia.</example>
	  <example>De manhã fui a casa da Ana.</example>
    </rule>
<!-- MARCOAGPINTO 2020-06-28 *END* -->

What it warns:

2528 rules tested.
Exception in thread “main” org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule CRASE_CONFUSION[8] in file /org/languagetool/rules/pt/grammar.xml: De manh? fui a praia."
Errors expected: 1
Errors found : 0
Message: Esta palavra rege-se com a preposiç?o “a”.
Analyzed token readings: [/SENT_START*] De[De manh?/RG*] [ /null*] manh?[De manh?/RG] [ /null*] fui[ir/VMIS1S0,ser/VMIS1S0] [ /null*] a[a/SPS00] [ /null*] praia[praia/NCFS000] .[./SENT_END*,./_PUNCT*]
Matches:
at org.languagetool.rules.patterns.PatternRuleTest.testBadSentences(PatternRuleTest.java:396)
at org.languagetool.rules.patterns.PatternRuleTest.testGrammarRulesFromXML(PatternRuleTest.java:318)
at org.languagetool.rules.patterns.PatternRuleTest.runTestForLanguage(PatternRuleTest.java:169)
at org.languagetool.rules.patterns.PatternRuleTest.runGrammarRulesFromXmlTestIgnoringLanguages(PatternRuleTest.java:152)
at org.languagetool.rules.patterns.PatternRuleTest.main(PatternRuleTest.java:683)

What could be wrong with the rule?

Thanks!

In <token inflected='yes' regexp='no'>fui</token>, you should remove “inflected”.

But I would use this, so that we see what happens with all forms that are “ser” and “ir” concurrently:

<and>
<token inflected="yes">ir</token>
<token inflected="yes">ser</token>
</and>

Where would I place the:

<and>
<token inflected="yes">ir</token>
<token inflected="yes">ser</token>
</and>

Thanks!

Instead of:

<token inflected='yes' regexp='no'>fui</token>

@jaumeortola

If the above fails, I was thinking about creating the “ser” verb forms from:
https://conjuga-me.net/verbo-ser

EDIT: To add to the exceptions.

Maybe this is the solution?

It gives tons of warnings:

<!-- MARCOAGPINTO 2020-06-28 *START* -->
<!-- "Fui a farmácia." -->
<!-- "Fui a praia." -->
<!-- "Fui a casa da Ana." -->
    <rule>
      <pattern>
        <marker>
<and>
<token inflected="yes">ir</token>
<token inflected="yes">ser</token>
</and>
          <token regexp='yes'>as?</token>
        </marker>
          <token postag_regexp='yes' postag='N.F.+'>
            <exception postag='D..F.+|R.+' postag_regexp='yes'/></token>
      </pattern>
      <message>Esta palavra rege-se com a preposição "a".</message>
        <suggestion>\1 <match no='2' regexp_match='(a)(s?)' regexp_replace='à$2'/></suggestion>
      <url>https://pt.wikipedia.org/wiki/Crase</url>
      <short>Erro de crase</short>
      <example correction='fui à'>De manhã <marker>fui a</marker> praia.</example>
      <example>De manhã fui a farmácia.</example>
	  <example>De manhã fui a praia.</example>
	  <example>De manhã fui a casa da Ana.</example>
    </rule>
<!-- MARCOAGPINTO 2020-06-28 *END* -->

I am going to try to add manually the verb forms to the exception.

I don’t understand your examples in this rule. “Fui a praia” is incorrect, but “fui a farmácia” is correct?
You also have “fui a praia” as both correct and incorrect. It is logically impossible to pass the tests.

@jaumeortola

I am stressed, so I can’t reason 100%.

Anyway, I will create this:

<!ENTITY requer_crase_verbo_ser "é|era|eram|éramos|eras|éreis|és|são|sede|seja|sejais|sejam|sejamos|sejas|ser|será|serão|serás|serdes|serei|sereis|serem|seremos|seres|seria|seriam|seríamos|serias|seríeis|sermos|sois|somos|sou">

And replace the “exception>ser< blah blah” with this entity.

Is it a better approach?

Thanks!

The entity approach seems to have worked.

I am going to check with 200 000 sentences.

Well, it produces hundreds of false positives :frowning:

I am planning to give up on it. :frowning:

EDIT:
@jaumeortola

Your suggestions passed TESTRULES PT:

<!-- MARCOAGPINTO 2020-06-28 *START* -->
<!-- "Fui a farmácia." -->
<!-- "Fui a praia." -->
    <rule>
      <pattern>
        <marker>
<and>
<token inflected="yes">ir</token>
<token inflected="yes">ser</token>
</and>
          <token regexp='yes'>as?</token>
        </marker>
          <token postag_regexp='yes' postag='N.F.+'>
            <exception postag='D..F.+|R.+' postag_regexp='yes'/></token>
      </pattern>
      <message>Esta palavra rege-se com a preposição "a".</message>
        <suggestion>\1 <match no='2' regexp_match='(a)(s?)' regexp_replace='à$2'/></suggestion>
      <url>https://pt.wikipedia.org/wiki/Crase</url>
      <short>Erro de crase</short>
      <example correction='fui à'>De manhã <marker>fui a</marker> praia.</example>
      <example>De manhã fui à farmácia.</example>
	  <example>De manhã fui à praia.</example>
    </rule>
<!-- MARCOAGPINTO 2020-06-28 *END* -->

I am going to test it with a 200 000 sentences check right now.

A possible solution is to write a rule with a list of nouns (farmácia, praia, praça, rua, loja…) that need “à”. The results won’t be comprehensive, but perhaps they will be good enough.

It is a good idea, not sure when I will have the chance to do it though :slight_smile:

Correção *identificar artigo e não intensificar