Back to LanguageTool Homepage - Privacy - Imprint

Spellchecker behaviour


(Tiago F. Santos) #1

Something has changed in the spellcheck behaviour for inflections.
In the portuguese main page on the the sentences is:

Nós prometo ajudá-lo.

There was one error in Nós prometo (verb form agreement error) but the spelling in the sentence is correct.
Now ajudá-lo is marked with a spelling error.
This would be OK for verb forms that are not hiphenated, but this is not the case here.
Since I have updated the hunspell dictionaries recently, I wondered if this is related with that update, but this change is very recent (one week tops) and they work perfectly in LibreOffice and system wide Hunspell.
Notice that if you choose pt-MZ that uses the standard "Dicionários Natura", you find the same issue.
Since there were recent changes to the spellchecker mechanisms, can this bug be related?

https://github.com/languagetool-org/languagetool-website/blob/master/www/pt/images/LT_screenshot.png

https://www.languagetool.org/pt/

I checked now my most recent change to the tokenizer, but the hiphens were no touched.


\u002A \u002B are escaped + and * respectively.


(Daniel Naber) #2

Those changes should be limited to German. You can use git checkout to easily check out any previous version and see if it's affected. Actually you can automatize the process of finding the commit that introduced the change with git bisect.


(Tiago F. Santos) #3

I have seen the commits and I have seen the commits reducing the scope to the German section, but I am not seeing other possibilities. Last spellchecking related change in pt was this:

<rule name="Ignore punctuation" id="IGNORE_PUNCTUATION">
  <pattern>
      <token postag='_PUNCT'/>
  </pattern>
  <disambig action="ignore_spelling"/>
</rule>

It should not interfere.
I have read about git bissect, but I believe this is a good excuse to try to learn how to use it. I will report back if I actually pinpoint the issue.


(Tiago F. Santos) #4

Indeed it was this commit with IGNORE_PUNCTUATION disambiguation rule:
I went the fast way with git checkouts (too many commits per day to use brute force).

# [de] update to latest de_DE.info
git checkout ce81e5d
git bisect good
# Merge branch 'master' of github.com:languagetool-org/languagetool
git checkout 5ec3fad
git bisect good
# [pt] IGNORE_PUNCTUATION disambiguation rule added
git checkout cbb5ca6
git bisect bad
# [pt] add disambiguation rule
git checkout 72b1c5d
git bisect good

I will be commenting it out, but another question is raised, why does ignore_spelling action interfere with the whole compound word Hunspell validation?
Is this the intended behaviour or should I add this as a possible bug in GitHub, for future reference?

PS. - This rule was created to avoid errors in separators like: ---------------------------


(Daniel Naber) #5

Probably because of the way tokens are ignored, they get replaced by whitespace. Thus, in the following step, the tokenization is different in this case.


(Tiago F. Santos) #6

Many thanks for looking into this and pointing me in the right direction.
I noticed that most languages actually show a spelling error in '---------------------' (apart from [en], which uses a Morphologik speller variation).
I tweaked the code a bit.

   boolean isMisspelled(String word) {
     boolean isAlphabetic = true;
+    boolean isSeparator = false;
     if (word.length() == 1) { // hunspell dictionaries usually do not contain punctuation
       isAlphabetic = Character.isAlphabetic(word.charAt(0));
+    } else {
+      isSeparator = word.matches("-+");
     }
-    return (isAlphabetic && !word.equals("--") && hunspellDict.misspelled(word)) || isProhibited(removeTrailingDot(word));
+    return (isAlphabetic && !isSeparator && hunspellDict.misspelled(word)) || isProhibited(removeTrailingDot(word));
   }

Or should I push a local solution like:

<rule name="Ignore punctuation" id="IGNORE_PUNCTUATION">
  <pattern>
      <token postag='_PUNCT'/>
      <token postag='_PUNCT'/>
      <token postag='_PUNCT'/>
  </pattern>
  <disambig action="ignore_spelling"/>
</rule>

(Daniel Naber) #7

Sounds good to me, but what was the original reason to add a disambiguation here? Are there punctuation characters that hunspell complains about? Which ones?


(Tiago F. Santos) #8

Great, but I could not understand which solution I should push. Hunspell or just the disambiguation in [pt]?

I noticed the problem with the hiphen, which I catalogued also as _PUNCT due to common use as a dash.
I generalized in order to safeguard every other possibility, without actually having to test, but this "shotgun" approach may be excessive.
Probably, limiting to hiphens like what is done in HunspellRule is enough, and therefore the reason I haven't added other possibilities there.


(Daniel Naber) #9

I'd suggest using the disambiguation in [pt].