Ngram spellchecker

Somehow, spell checking is considered as a process for single words. But there are lots of word fixed and very common word groups for which spellcheck could be useful when done using mutliple ‘tokens’.

I suggested a method like this several times, but unfortunately, it was not picked up by anyone (no problem; there is so much to do around grammar…)
But I will give it a try myself in the time to come. Not in Java, not in LanguageTool, since I never mastered the (at least for me) complex ecosystem of LT.

I will try this for English, since most of the people here understand that. When I manage to get a prototype running (webserver, PHP) I will get back to you for you to check it out. Not only will it report possible errors and maybe suggestions, but it will also ask for feedback, so the software might learn.

It is rather ambitious, but: no guts, no glory.

Actually, Fabian is working on this… so maybe wait 1-2 weeks, so we can see what his results will be.

Ah. Nice. If I may contribute ideas:

  • Ngrams status may vary from certainly bad to certainly okay, and all in between. Feedback by language pros will be essential.
    -It is a choice if an ngram is also built across sentence boundaries or not. Doing it might result in better detection of forgotten sentence endings. Not doing it will limit the ngrams to sentence size, which could be good for shorter sentences.
  • part of ngram checking could (should) be off-line, like getting the most likely alternatives for a common ngram. A lot like spell checking, but treating spaces as a contributing character. So a runon is almost equivalent to a run-on and a ‘run on’, even though the number of tokens is different.
  • data might be too large for memory, even when hashed. My experience with hashing, sorting and indexed direct access is that it is almost as fast as in-memory, because of smart os caching.

Anyway, I just started making the ngrams; that will take a lot of time.

Would be happy to contribute in the functional testing.

By the way. You know I have quite a bit of data on ‘almost’ every language. So it the idea is successful and it could be implemented for other languages, I am willing to contribute by building a corpus per language.