[GSoC] Idea - Spelling error detection and text correction

I would love to use a deep learning based approach for spelling error detection.

I intend to use a char-level model and the dataset i intend to use is the billion word dataset.

some of the examples of error that can be detected are :

original : he had dated forI much of the past
corrected : he had dated for much of the past

original : Since then, the bigjest players in
corrected : Since then, the biggest players in

original : in te third quarter of last year,
corrected : in the third quarter of last year,

That’s interesting, but pure spelling errors are those errors that LT can already detect. So it’s only interesting if the suggestions are better than what we have now. We also don’t want to be limited to English, so would this work for other languages with fewer data? Are the examples from some prototype or are these just examples?

As this is a deep learning based approach, it’ll be better than the one LT uses right now.
And yes , this is from a prototype on which I’m working.

Still not sure how well it’ll work for languages with fewer data, but still we can give it a try.

Can you please mention some off the languages that you would like to include.

If you think this is an acceptable GSoC project , then I can draft a proper proposal as well.

I can even extend this to a text corrector .
Here are some examples :

input : Kvothe went to market
output : Kvothe went to the market

input : the Cardinals did better then the Cubs in the off season
output : the Cardinals did better than the Cubs in the off season

Currently, the detection of errors is based on hunspell dictionaries. As this is a simple and easily maintainable approach, we should stick to it. For suggestions, I’m not sure I have understood how your approach works. Will it simply suggest the most probably sequence of characters, given an input? What data is the billion word corpus, will it, for example, also work for colloquial style?

In the end, it should work for all maintained languages, if possible (LanguageTool - Supported Languages).

I think so.

I’m used a LSTM seq2seq model with char level inputs rather than word level.

The dataset is based on news articles. But getting it to work for colloquial style won’t be a big task , I can combine data from few dataset from different domains like news,reddit , Cornell Movie-Dialogs Corpus etc to make it more generalised , so that it works for both Grammatical Error Correction and Spelling Correction.

Please bear in mind that some languages (Dutch, German, Danish) have very long words, Dutch in every day texts around 35 max, but actually unlimited.

I’m panning to work with 3-4 languages(English is one of them) for the GSoC project, it is fine ???
Should i start drafting the proposal

What are those languages? Can the approach be extended to more languages?

Please do.

yes this approach can be extended to other languages .
and i’m planning to choose any language which has a decent corpus available like french