[pt] Improve rule ID: PORTUGUESE_WORD_REPEAT_RULE

marcoagpinto · September 21, 2021, 9:11pm

The following sentence triggers an error:

É só telefonemas e e-mails de clientes a reclamar.

Since it is a Java rule and I don’t know Java, could one of you code an antipattern for it?

e
space
e > space_next="no"
- space next="no"

Thanks!

udomai · September 22, 2021, 10:14am

Hi!

This is REDUNDANT_CONJUNCTIONS, first subrule. It’s in the grammar.xml. Could you add the AP for

<token>e</token>
<token>e</token>
<token spacebefore="no"/>

?

marcoagpinto · September 22, 2021, 10:25am

Let me try.

First I will run a check on Tatoeba + Wikipedia 600 000 sentences to see the before and the after.

marcoagpinto · September 22, 2021, 10:38am

It didn’t work:

<!-- MARCOAGPINTO 2021-09-22 (25-JUN-2021+) *START* -->
<!--
É só telefonemas e e-mails de clientes a reclamar.
-->
      <antipattern>
		<token>e</token>
		<token>e</token>
		<token spacebefore="no"/>
      </antipattern>
<!-- MARCOAGPINTO 2021-09-22 (25-JUN-2021+) *END* -->

I tried to search for the words of the suggestion, and they aren’t found in the grammar.xml, so it must be in Java.

udomai · September 22, 2021, 11:29am

What message do you see? I see “Possível fragmento. Utilize apenas uma conjunção deste tipo.”

It is in the grammar.xml. The antipattern works. See this commit.

marcoagpinto · September 22, 2021, 12:48pm

"Possível erro de digitação. Repetiu uma palavra.

Examples:
Este é é apenas uma frase de exemplo. x
Este é apenas uma frase de exemplo. ✓"

udomai · September 22, 2021, 1:07pm

I can’t reproduce this. The sentence you gave above only triggers the XML rule:

É só telefonemas e e-mails de clientes a reclamar.

The second example you gave is a true positive (since there is a space behind the second word).

marcoagpinto · September 22, 2021, 1:16pm

I was using the Stand-alone tool to get the examples.

marcoagpinto · September 22, 2021, 1:17pm

that is how I find the rules.

jaumeortola · September 22, 2021, 1:28pm

The best solution is to add “e-mail” and “e-mails” as tagged words. This way they will be tokenized as one token. Done here: [pt] e-mail (added.txt) · languagetool-org/languagetool@b829941 · GitHub