This ensures that the spell checker does not complain about the word when it is alone. A compound word like Bahá’í-Weltzentrum, however, still confuses the spell checker. The only solution I have found so far, is to extend the rule with each know compound word:
I don’t think so, as it’s supposed to be used for single words, but the apostrophe will probably cause the word being split by the tokenizer. Anyway, added.txt is for the internal part-of-speech tagger and not related to spelling (but you probably know that).
OK. Thanks for the quick answer. I had assumed that a tokenized word would automatically be defined as being spelled correctly. After a few tests and a few articles I realized that is not the case: the spell checker and tokenizer are separate.
I will see what I can do about tokenizing my words at a later point. At the moment I am using this adaptation of your solution:
I found a very interesting thread at stackoverflow explaining how hunspell works. Since Languagetool is using hunspell, the explanation there applies to the Languagetool spelling checker:
This is a good example of why the Unicode standard is important. The U+0027 apostrophe character is punctuation, not intended to be part of a word. Use the word-forming code point U+02BC instead.
So instead of using the disambiguation rule I can simple add the word using the word-forming code point ‘MODIFIER LETTER APOSTROPHE’:
Old: Bahá’í
New: Baháʼí
You need a hex editor to see that the two apostrophes are actually different!
So now I can add the new form to spelling.txt and the spell checker can handle it, either alone or as a part of compound words.