Change in run-on words spell checker behavior

Hi,

for German, we have quite some issues getting good spell checking suggestions (see #731, #725 and others). One improvement is to rank the split suggestion of run-on words higher, e.g. vorallem should suggest vor allem. I’m now going to commit a patch that does that, but it breaks the Catalan tests. I’ll fix those, but I’m not sure if my change introduces more issues than it fixes for Catalan. Please let me know (e.g. @jaumeortola). For German, I think it helps quite a bit. Other languages will also be affected, they just don’t seem to have any test cases that get triggered by this change (except one for English, which I have commented out for now).

Regards
Daniel

Hi,

This change makes the Catalan suggestions a lot worse, and I think it will worsen the suggestions in other languages too (even in German?).

I suggest another approach. The run-on words should not be given priority, except in some cases. For example, in German, words starting with: wie, und, vor, der, die, das, mit…

This is done in Catalan. See the methods orderSuggestions (very simple) or getAdditionalTopSuggestions (adds apostrophes or hyphens) in MorfologikCatalanSpellerRule.

My longer term idea is to sort suggestions by their probability, considering the context. I.e., using ngram data. That will take some time, until then I could revert my change for non-German languages.

I think that’s better. You can override orderSuggestions for German:

There is a more simple approach. With “vorallem”, look up “vor” and “allem” in the dictionary, and if the added frequency of both is high enough, then put the suggestion at the top of the list. This approach could be useful for different languages.

I’m not sure if the frequency in the Morfologik speller dictionary can be read from LanguageTool. In Catalan the frequency is in the tagger dictionary (the only dictionary we have), but in German it is in the speller dictionary, isn’t it? Or just take a list with the 100 most common words in German.

Of course, with n-grams you could get better results and detect other kinds of typos.

I’ve reverted this change now except for German.