Back to LanguageTool Homepage - Privacy - Imprint

Better spelling suggestions would be really nice to have


(Jan Schreiber) #1

Today a LanguageTool user suggested the word 'analpherbet' should be added to the German dictionary, almost certainly a misspelling of 'Analphabet'.
Aspell's first suggestion is 'Analphabet', LanguageTool (via Hunspell) doesn't offer any suggestions, that's why that particular user suggested this word to the dictionary.
IMO, this is an extremely disappointing result: Users apparently assume that 'analpherbet' isn't in the dictionary due to some oversight and think their spelling is correct because we do not offer a reasonable correction.
Would it be possible to base the LT suggestions on a fixed list of words, for example? The suggestions Hunspell has to offer are often weird, and sometimes it fails to offer any.


(Daniel Naber) #2

Actually, hunspell's suggestion is correct, you can check that at https://j3e.de/cgi-bin/spellchecker. But LT doesn't use hunspell, as its generation of suggestions is too slow. For German, we use a Morfologik-based approach that doesn't always work exactly like hunspell. Analpherbet (uppercase A) works okay, for example.

I don't have a solution for this. Any solution would need to be 100% Java, very fast, and would need to support compound words. As far as I know, such a spell checker doesn't exist yet.


(Daniel Naber) #3

Actually, we don't need everything on our server to be 100% Java. If someone knows a fast spell checker that supports compounds and comes with a dictionary for German, let me know.


(Lodewijk Arie van Brienen) #4

the main problem with the accuracy of Hunspell is that it doesn't have the option to sort by phonetic distance. (I've had many cases where the proper word was rendered 'invisible' by its position in the list)


(Daniel Naber) #5

I just ran a test by setting MAX_EDIT_DISTANCE = 3; (instead of 2). With that, analpherbet will get its correct suggestion. Unfortunately, this slows down checking by a factor of 2.5 (measured with org.languagetool.rules.patterns.PerformanceTest), so I don't think that's an option.


(Tiago F. Santos) #6

This is done in the REP section of the affix file (e.g. de_DE.aff).
The main issue is that each replacement pair (phonetic equivalent) has to be added manually to the list.
Hunspell seems fast to me, but it does a lot of processing. It decreases even further in speed, with extra logic added to the aff file, or, as for example, with a very good and extensive wordlists, like the ones that are maintained here for German.
Extra rules, similar to the phonetic replacement lists, are the most time-consuming.
The more you improve, the slower it gets. No way around it. The logic also applies to grammar rules, though.


(Jan Schreiber) #7

Too bad it didn't work out. Thanks for giving it a shot anyway!


(Curon Davies) #8

Symspell is an interesting approach, there is no implementation that supports compounds, but there is a suggestion in this reply:
http://blog.faroo.com/2012/06/24/1000x-faster-spelling-correction-source-code-released/#comment-961485

Maybe the could be incorporated into Morfologik?


(Jan Schreiber) #9

I made a feature request on GitHub with a sketch of a very simple solution.