Back to LanguageTool Homepage - Privacy - Imprint

[sv] Handling proper nouns when spell checking

(Martin Flodin) #1

The texts we are spell checking contain a lot of proper nouns that end up as possible spelling mistakes. Most often it is names of people and places, but as we are spell checking news articles a lot of them are foreign names and places.

What is the proper way of handling this? Is it just to add all variations to the dictionary? E.g. Hussain, Hussein, Hossain, Hosain, etc. Is there a common dictionary for names, or should they be redundant in each language?

Quite commonly the names appear multiple times in the text. Is it possible to check them for consistency? I.e. if it is spelt Hussain four times and Hossain once, we could assume that there was a typo. On the other hand this doesn't catch the cases when the name is consistently misspelt, but still as a common variation of the name, but that seems almost impossible to catch. But perhaps one could catch the error when it is part of the full name, e.g. Barack Hossain Obama would be corrected to Barack Hussein Obama.

(Daniel Naber) #2

There's no language-independent dictionary for names. To accept words, you can add them to spelling.txt for each language. To check for coherency, you'd need to activate WordCoherencyRule, i.e. add it to getRelevantRules() in, then add them to coherency.txt (which currently only accepts two variants per word).

(Martin Flodin) #3

Ok. Since spelling.txt is empty for Swedish, I thought that it was only supposed to be used locally, but now I see that it contains some words for other languages.

What is the standard operating procedure for contributing new words? Should I just add them to spelling.txt and then do a pull request? Or is it better to update the base dictionary (which I am not really sure how to do)? Perhaps they are occasionally moved to the real dictionary by a maintainer?

I think I read that one would need to add all variations of a word like conjugations and such. Is that correct?

Is it recommended to add names to the list, or does that bloat it too much?

(Daniel Naber) #4

As we use an existing hunspell dictionary, any word we add should also be reported to the maintainer of the dictionary:

And we should make sure the latest dictionary is used. We can still add words to spelling.txt so we don't have to wait for new releases of the dictionary.

Yes, but you can add simple suffixes as we have done for German. It's like a simple hunspell system. You can e.g. add /S to accept the word and the word with and s appended (I think Java code needs to be adapted for that so you can have suffixes useful for Swedish).

This is hard to answer in general. I'd only add common names.