Adding a word with an apostrophe to the spelling checker: Bahá'í

I am checking German text which includes the word Bahá’í.

Bahá’í can be used as:

  • a noun: Er traf einen Bahá’í. (He met a Bahá’í)
  • as part of a compound word: Er besuchte das Bahá’í-Weltzentrum. (He visited the Bahá’í World Center)

I have applied the solution described in another post to ignore spelling. This is the result

<rule name="Bahá’í" id="BAHAI">
    <pattern>
        <marker>
            <token>Bahá</token>
            <token spacebefore="no">’</token>
            <token spacebefore="no">í</token>
        </marker>
    </pattern>
    <disambig action="ignore_spelling"/>
</rule>

This ensures that the spell checker does not complain about the word when it is alone. A compound word like Bahá’í-Weltzentrum, however, still confuses the spell checker. The only solution I have found so far, is to extend the rule with each know compound word:

<rule name="Bahá’í" id="BAHAI">
    <pattern>
        <marker>
            <token>Bahá</token>
            <token spacebefore="no">’</token>
            <token spacebefore="no">í|í-Weltzentrum</token>
        </marker>
    </pattern>
    <disambig action="ignore_spelling"/>
</rule>

A list of all possible compound words would be infinitely long.

How can I add markup for the word to avoid adding a long list of compound words?

Woudl something like this work?

<disambig postag="SUB"/>

It should be possible to use a regex here to match anything:

<token spacebefore="no" regexp="yes">í|í-.*</token>

Thanks for the answer! If I understand correctly, spelling mistakes in the second word would go undetected.

  1. Would it be possible to just add the word with the apostrophe to added.txt?
  2. Can added.txt handle words with apostrophes?

Basically, I want this and other words with apostrophes to be treated as normal nouns or names.

I don’t think so, as it’s supposed to be used for single words, but the apostrophe will probably cause the word being split by the tokenizer. Anyway, added.txt is for the internal part-of-speech tagger and not related to spelling (but you probably know that).

OK. Thanks for the quick answer. I had assumed that a tokenized word would automatically be defined as being spelled correctly. After a few tests and a few articles I realized that is not the case: the spell checker and tokenizer are separate.

I will see what I can do about tokenizing my words at a later point. At the moment I am using this adaptation of your solution:

<rule name="Bahá’í" id="BAHAI">
    <pattern>
        <token>Bahá</token>
        <token spacebefore="no" regexp="yes">['’]</token>
        <token spacebefore="no" regexp="yes">í(-.*)?</token>
    </pattern>
    <disambig action="ignore_spelling" />
</rule>

I found a very interesting thread at stackoverflow explaining how hunspell works. Since Languagetool is using hunspell, the explanation there applies to the Languagetool spelling checker:

This is a good example of why the Unicode standard is important. The U+0027 apostrophe character is punctuation, not intended to be part of a word. Use the word-forming code point U+02BC instead.

So instead of using the disambiguation rule I can simple add the word using the word-forming code point ‘MODIFIER LETTER APOSTROPHE’:

Old: Bahá’í
New: Baháʼí

You need a hex editor to see that the two apostrophes are actually different!

So now I can add the new form to spelling.txt and the spell checker can handle it, either alone or as a part of compound words.