Better spelling suggestions would be really nice to have

Jan_Schreiber · February 25, 2017, 12:01am

Today a LanguageTool user suggested the word ‘analpherbet’ should be added to the German dictionary, almost certainly a misspelling of ‘Analphabet’.
Aspell’s first suggestion is ‘Analphabet’, LanguageTool (via Hunspell) doesn’t offer any suggestions, that’s why that particular user suggested this word to the dictionary.
IMO, this is an extremely disappointing result: Users apparently assume that ‘analpherbet’ isn’t in the dictionary due to some oversight and think their spelling is correct because we do not offer a reasonable correction.
Would it be possible to base the LT suggestions on a fixed list of words, for example? The suggestions Hunspell has to offer are often weird, and sometimes it fails to offer any.

dnaber · February 25, 2017, 12:23pm

Actually, hunspell’s suggestion is correct, you can check that at Online spell checker. But LT doesn’t use hunspell, as its generation of suggestions is too slow. For German, we use a Morfologik-based approach that doesn’t always work exactly like hunspell. Analpherbet (uppercase A) works okay, for example.

I don’t have a solution for this. Any solution would need to be 100% Java, very fast, and would need to support compound words. As far as I know, such a spell checker doesn’t exist yet.

dnaber · February 25, 2017, 12:38pm

Actually, we don’t need everything on our server to be 100% Java. If someone knows a fast spell checker that supports compounds and comes with a dictionary for German, let me know.

SkyCharger001 · February 25, 2017, 12:56pm

the main problem with the accuracy of Hunspell is that it doesn’t have the option to sort by phonetic distance. (I’ve had many cases where the proper word was rendered ‘invisible’ by its position in the list)

dnaber · February 25, 2017, 3:15pm

I just ran a test by setting MAX_EDIT_DISTANCE = 3; (instead of 2). With that, analpherbet will get its correct suggestion. Unfortunately, this slows down checking by a factor of 2.5 (measured with org.languagetool.rules.patterns.PerformanceTest), so I don’t think that’s an option.

tiagosantos · February 25, 2017, 5:06pm

This is done in the REP section of the affix file (e.g. de_DE.aff).
The main issue is that each replacement pair (phonetic equivalent) has to be added manually to the list.
Hunspell seems fast to me, but it does a lot of processing. It decreases even further in speed, with extra logic added to the aff file, or, as for example, with a very good and extensive wordlists, like the ones that are maintained here for German.
Extra rules, similar to the phonetic replacement lists, are the most time-consuming.
The more you improve, the slower it gets. No way around it. The logic also applies to grammar rules, though.

Jan_Schreiber · February 25, 2017, 5:12pm

Too bad it didn’t work out. Thanks for giving it a shot anyway!

curon · February 27, 2017, 9:37pm

Symspell is an interesting approach, there is no implementation that supports compounds, but there is a suggestion in this reply:
http://blog.faroo.com/2012/06/24/1000x-faster-spelling-correction-source-code-released/#comment-961485

Maybe the could be incorporated into Morfologik?

Jan_Schreiber · March 1, 2017, 1:03am

I made a feature request on GitHub with a sketch of a very simple solution.