[GSOC] Context based spellchecker sugestions

Hi everyone

I would like to present the core idea of the GSOC project I intend to work on, and ask for the community opinion. :slight_smile:
Drawing inspiration from the current confusion pairs approach, that uses a neural net combined with word2vec models, I would like to propose that a similar model may be created to deal with spellchecker suggestions.
A word2vec model of a language is a vector space in which words are distributed across the plane. One very interesting feature of this representation of words is that it allows capturing semantic similarities between words. Words with similar meanings or that are used in similar contexts are grouped together in regions of this vector space.
From what I gathered misspellings suggestions don’t consider surrounding words, only the misspelled word is corrected using Hunspell. With the use of a word2vec model it may be possible to gather a list of semantically close words based on the surrounding words, which can reduce the computation needed to create the suggestion list. Instead of having to compute permutations of the misspelled word, we might use the word2vec model to search for spatially close words. The creation of a neural net that can predict a word based on the surroundings is definitely possible, for example the CBOW architecture used in word2vec, and have good results.
This approach can reduce the computing time and create suggestion lists of better quality. Later on, a neural net can be used to filter the results and find the best match.

This sounds promising. For data-based algorithms, I think you should also define what data the algorithms will be trained on, for how many languages data is available, and how the solution is going to be evaluated.

Fasttext could be an alternative approach you may want to look at, as it conveniently has pre-trained models for 157 different languages. There is some documentation on this here:

There doesn’t appear to be an official native Java library for FastText, but someone has created a port which may need some work: