[pt] Problem creating an antipattern - 2021-09-02

marcoagpinto · September 2, 2021, 5:07am

I wanted to improve the rule Tiago created years ago that detects spaces before punctuation so that it can remove false positives when we are referring to file extensions.

The problem is that his rule has no “pattern” so the “antipattern” triggers an error:

RULE:

  <rulegroup id="SPACE_BEFORE_PUNCTUATION" name="Espaços antes da pontuação">
    <!-- Based on German grammar.xml, by Tiago F. Santos, 2017-07-08 -->
    <rule>
      <regexp>\b([\p{L}\d]+) ([!?»”’,….])</regexp>
      <message>Remova o espaço antes deste sinal de pontuação.</message>
        <suggestion>\1\2</suggestion>
      <example correction="escapou!">Como é que isto me <marker>escapou !</marker></example>
    <!--example correction="escapou!">Como é que isto me <marker>escapou   !</marker></example-->
      <example correction="roda.">Existem duas estratégias possíveis: aproveitar o que existe ou reinventar a <marker>roda .</marker></example>
    </rule>
    <rule>
      <regexp>\b([\p{L}\d]+) ([:;])(?![\-o]?(?:[()/]|[DSP]\b))</regexp>
      <message>Remova o espaço antes deste sinal de pontuação.</message>
        <suggestion>\1\2</suggestion>
      <example correction="possíveis:">Existem duas estratégias <marker>possíveis :</marker> aproveitar o que existe ou reinventar a roda.</example>
      <example>Um sorriso :-)</example>
      <example>Um sorriso :)</example>
      <example>Um sorriso :(</example>
      <example>Um sorriso :-/</example>
      <example>Um sorriso :/</example>
      <example>Um sorriso :D</example>
      <example correction="Brasil;">Site de Instituto Ludwig von Mises <marker>Brasil ;</marker>Principais portais web</example>
    </rule>
  </rulegroup>

MY ANTIPATTERN:

<!-- MARCOAGPINTO 2021-09-02 (25-JUN-2021+) *START* -->
<!--
Os ficheiros .png não perdem qualidade.
-->
      <antipattern>
		<token regexp='yes'>extensão|extensões|ficheiros?</token>
		<token regexp='yes' spacebefore='yes'>[.]</token>
		<token spacebefore='no' postag='NC.+|UNKNOWN' postag_regexp='yes'/>
      </antipattern>
<!-- MARCOAGPINTO 2021-09-02 (25-JUN-2021+) *END* -->

Is there an easy fix?

Thanks!

Ruud_Baars · September 2, 2021, 6:05am

Why don’t you convert the regexp to a regular pattern? Use spacebefore=yes. It might not have existed yet when the rule was created.

marcoagpinto · September 2, 2021, 6:19am

@Ruud_Baars

I don’t know how to do it.

udomai · September 2, 2021, 8:00am

<regexp>\b([\p{L}\d]+) ([!?»”’,….])</regexp>

=

<token regexp="yes">[\p{L}\d]+</token>
<token spacebefore="yes">[!?»”’,….]</token>

marcoagpinto · September 2, 2021, 8:12am

lol @udomai

marcoagpinto · September 2, 2021, 8:13am

I will do it in the afternoon then.

I have a date today!

Thanks!

marcoagpinto · September 2, 2021, 2:35pm

@udomai

I checked the diff before and after your suggestions and the results are very different:

  <rulegroup id="SPACE_BEFORE_PUNCTUATION" name="Espaços antes da pontuação">
    <!-- Based on German grammar.xml, by Tiago F. Santos, 2017-07-08 -->
	<!-- Converted <regexp> to <pattern> thanks to Ruud and Udomai - 2021-09-02 -->
    <rule>
	  <pattern>
		<token regexp="yes">[\p{L}\d]+</token>
		<token spacebefore="yes" regexp="yes">[!?»”’,….]</token>	  
      <!-- <regexp>\b([\p{L}\d]+) ([!?»”’,….])</regexp> *** TIAGO VERSION 2017-07-08 *** -->
	  </pattern>
      <message>Remova o espaço antes deste sinal de pontuação.</message>
        <suggestion>\1\2</suggestion>
      <example correction="escapou!">Como é que isto me <marker>escapou !</marker></example>
    <!--example correction="escapou!">Como é que isto me <marker>escapou   !</marker></example-->
      <example correction="roda.">Existem duas estratégias possíveis: aproveitar o que existe ou reinventar a <marker>roda .</marker></example>
    </rule>
    <rule>
      <regexp>\b([\p{L}\d]+) ([:;])(?![\-o]?(?:[()/]|[DSP]\b))</regexp>
      <message>Remova o espaço antes deste sinal de pontuação.</message>
        <suggestion>\1\2</suggestion>
      <example correction="possíveis:">Existem duas estratégias <marker>possíveis :</marker> aproveitar o que existe ou reinventar a roda.</example>
      <example>Um sorriso :-)</example>
      <example>Um sorriso :)</example>
      <example>Um sorriso :(</example>
      <example>Um sorriso :-/</example>
      <example>Um sorriso :/</example>
      <example>Um sorriso :D</example>
      <example correction="Brasil;">Site de Instituto Ludwig von Mises <marker>Brasil ;</marker>Principais portais web</example>
    </rule>
  </rulegroup>

I tested against a 600 000 sentence corpus.

What could be wrong in it?

Thanks!

marcoagpinto · September 3, 2021, 7:06am

@udomai @Ruud_Baars @jaumeortola

Any idea of how to fix the rule?

The changes should show the same diff as before the change.

Thanks!

jaumeortola · September 3, 2021, 7:33am

Can you post examples of the unexpected differences?

marcoagpinto · September 3, 2021, 7:44am

@jaumeortola

Yes, see the diff of before and after, attached here.

before_after.zip (658.0 KB)

Ruud_Baars · September 5, 2021, 3:44pm

You could try adding postag=‘SENT_END’ to the last postag

marcoagpinto · September 6, 2021, 6:43am

@Ruud_Baars

The results even become a lot smaller with that SENT_END

It removes tons of positives.

See attachment for diff:
before=original rule
after=rule with changes days ago
after2=rule with SENT_START
before_after_after2.zip (832.2 KB)

Ruud_Baars · September 6, 2021, 6:06pm

That means thatvthe regexp rule does not use the segemtation (?) It also means there might be deficiencies in segment.srx