[en] Expected behaviour of spelling.txt

Last week, I added Qur’an to spelling.txt.LT gives a spelling error:
image

As a test, I added these terms to spelling.txt in my local copy of LT:
notcap’italised
Capti’lised

LT gives warnings about spelling for those terms. Should it give warnings?

There are two solutions for this issue:

  • To make a more complex word tokenizer that allows apostrophes inside a word.
  • To create a multiwords.txt file, where words like “Qur’an” or “Palme d’or” are tagged regardless of their number of tokens.

But, there are hyphenated words in spelling.txt, and they give the expected result. (That is why I showed Palme d’or as an example.)

The current English word tokenizer creates one token for “avant-garde”, but four tokens for “Palme d’or”. If there is no error in “Palme d’or”, it is because each token individually is allowed as an independent word.

So the current results are the expected results. The fix is one of the two solutions I mentioned.

If you want to tag expressions containing white spaces (like “Palme d’or”), the multiwords.txt file is necessary.

imatge

Palme alone gives a spelling error, but not Palme d’or.

image

I see. “Palme d’or” has its own disambiguation rule.

You can write rules like this (for “Palme d’Or”, “Qur’an”, etc.) or you can use a multiwords.txt file. The result will be equivalent.

@jaumeortola, thanks, but I do not understand.

  1. Since when has the content of spelling.txt been related to parts of speech?
  2. Anyway, Qur’an has a disambiguation rule that assigns NNP to each of the 3 tokens.

In some languages, when a token is tagged the spelling is ignored automatically (for example: languagetool/MorfologikCatalanSpellerRule.java at master · languagetool-org/languagetool · GitHub). That’s not the case in English, it seems, and I was not aware of it.

I see now that spelling.txt also allows multiwords. Perhaps it doesn’t support multiwords without white spaces (like Qur’an)? I don’t know. The implementation of multiwords in spelling.txt is different from what I implemented in multiwords.txt. I’m no able to provide more help. Perhaps @Knorr or @dnaber can help you.

The fact that words with spaces work in spelling.txt is because this has been implemented as a special case (here). This could surely be improved…

@jaumeortola, thanks for your comments. I didn’t know about the different behaviours in different languages.

@dnaber, Qur’an does not contain a space.

Fixed here: spelling.txt: support multiwords without white space · languagetool-org/languagetool@2d47afc · GitHub

1 Like

@jaumeortola, great, thank you.