[en] Expected behaviour of spelling.txt

Mike_Unwalla · October 7, 2019, 8:15am

Last week, I added Qur’an to spelling.txt.LT gives a spelling error:

As a test, I added these terms to spelling.txt in my local copy of LT:
notcap’italised
Capti’lised

LT gives warnings about spelling for those terms. Should it give warnings?

jaumeortola · October 7, 2019, 8:39am

There are two solutions for this issue:

To make a more complex word tokenizer that allows apostrophes inside a word.
To create a multiwords.txt file, where words like “Qur’an” or “Palme d’or” are tagged regardless of their number of tokens.

Mike_Unwalla · October 7, 2019, 9:11am

But, there are hyphenated words in spelling.txt, and they give the expected result. (That is why I showed Palme d’or as an example.)

jaumeortola · October 7, 2019, 9:23am

The current English word tokenizer creates one token for “avant-garde”, but four tokens for “Palme d’or”. If there is no error in “Palme d’or”, it is because each token individually is allowed as an independent word.

So the current results are the expected results. The fix is one of the two solutions I mentioned.

If you want to tag expressions containing white spaces (like “Palme d’or”), the multiwords.txt file is necessary.

imatge

Mike_Unwalla · October 7, 2019, 9:26am

Palme alone gives a spelling error, but not Palme d’or.

jaumeortola · October 7, 2019, 9:33am

I see. “Palme d’or” has its own disambiguation rule.

You can write rules like this (for “Palme d’Or”, “Qur’an”, etc.) or you can use a multiwords.txt file. The result will be equivalent.

Mike_Unwalla · October 7, 2019, 2:05pm

@jaumeortola, thanks, but I do not understand.

Since when has the content of spelling.txt been related to parts of speech?
Anyway, Qur’an has a disambiguation rule that assigns NNP to each of the 3 tokens.

jaumeortola · October 7, 2019, 3:16pm

In some languages, when a token is tagged the spelling is ignored automatically (for example: languagetool/MorfologikCatalanSpellerRule.java at master · languagetool-org/languagetool · GitHub). That’s not the case in English, it seems, and I was not aware of it.

I see now that spelling.txt also allows multiwords. Perhaps it doesn’t support multiwords without white spaces (like Qur’an)? I don’t know. The implementation of multiwords in spelling.txt is different from what I implemented in multiwords.txt. I’m no able to provide more help. Perhaps @Knorr or @dnaber can help you.

dnaber · October 7, 2019, 5:38pm

The fact that words with spaces work in spelling.txt is because this has been implemented as a special case (here). This could surely be improved…

Mike_Unwalla · October 8, 2019, 7:21am

@jaumeortola, thanks for your comments. I didn’t know about the different behaviours in different languages.

@dnaber, Qur’an does not contain a space.

jaumeortola · October 8, 2019, 8:47am

Fixed here: spelling.txt: support multiwords without white space · languagetool-org/languagetool@2d47afc · GitHub

Mike_Unwalla · October 8, 2019, 10:18am

@jaumeortola, great, thank you.