How to add portuguese abbreviated word ignore rule?

andrew.yang · July 19, 2016, 8:14am

There’s a problem in portuguese:

An abbreviated word must end with a period ("."), and the longer the word, the more the possible abbreviations it can have, for this reason, it is not possible to provide a list of all possible abbreviations. Below are two examples:

Localização (means “localization”)
Localizaç.
Localiz.
Local.
Loc.

Selecionar (means verb “to select”)
Selecion.
Selec.
Sel.

Right now, the checking tool considers all periods as sentence periods, and cannot identify them as part of the abbreviation. Example:

Selecion. is reported as Selecion
Selec. is reported as Selec
Sel. is reported as Sel

To solve this issue, my suggestion is to create something like an IF function:

IF “ABCD” word is followed by a period, then, ignore the issue.
IF “ABCD” word is NOT followed by a period, then, it is an issue.

If these two measures are taken, the false positives will easily go down by 70%(!).

So, I want to add a rule in org/languagetool/resource/pt/disambiguation.xml

Can someone help me? does this work?

  <rule id="ABBR" name="abbreviation">
    <pattern>
      <token regexp="yes">\w*\.</token>
    </pattern>
    <disambig action="ignore_spelling"/>
  </rule>

jaumeortola · July 19, 2016, 8:45am

First, you can improve the sentence tokenization editing the file segment.srx

In disambiguation, you can do something like:

 <rule id="ABBR" name="abbreviation">
    <pattern>
      <marker>
          <token regexp="yes">\w+</token>
      </marker>
      <token>.</token>
    </pattern>
    <disambig action="ignore_spelling"/>
  </rule>

But I think this is problematic. You will remove many false positives, but you are going to cause false negatives (undetected spelling errors).

You should try to determine if there is really a sentence ending (improving segment.rsx). For example, if after the period there is a lower-case word (and it is not an error itself), then probably it is not a sentence ending.

Anyway, if you cannot provide a list of abbreviations, the problem is really hard. You can try to check if exists some word in the dictionary that can complete the abbreviation (to be done in Java).

andrew.yang · July 19, 2016, 9:15am

Yes, Thanks!

so how to improving segment.rsx? it’s difficult for me.

just delete “.” in Portuguese segment?

<languagerule languagerulename="Portuguese">
...
<!-- Break rules -->
<rule break="yes">
<beforebreak>[!?…][\u0002|'|"|«|\)|\]|\}¹²³]?\s+</beforebreak>
<afterbreak></afterbreak>
</rule>
<rule break="yes">
<beforebreak>[!?…]['"\p{Pe}\u00BB\u201D]?</beforebreak>
<afterbreak>\p{Lu}[^\p{Lu}]</afterbreak>
</rule>
<rule break="yes">
<beforebreak>\s\p{L}[!?…]\s</beforebreak>
<afterbreak>\p{Lu}\p{Ll}</afterbreak>
</rule>
</languagerule>

jaumeortola · July 19, 2016, 9:43am

A period is most of the time a sentence ending. So it is a good general rule. You should try to add exceptions to this rule (i.e. expand the list of abbreviations). There is no silver bullet.

You can refine the rules depending on the context: abbreviations that can/cannot finish a sentence, abbreviations followed/preceded by numbers, etc. Take a look at other languages.

marcoagpinto · July 19, 2016, 11:38am

Hello!

Years ago I sent to Daniel Naber dozens of Portuguese abbreviations.

marcoagpinto · July 19, 2016, 12:07pm

Besides, the abbreviations you suggested will appear as typos because they aren’t in the Portuguese speller maintained by Minho University.

They can’t be found in the list of official abbreviations.