Interest in GSOC project

Hi,

My name is Pramodith and I’m a graduate Computer Science Student at the Georgia Institute of Technology. I am interested in working on the core AI aspect of the tool. I found the project of using a seq2seq Deep learning model to clear confusion between words and improving the spell checker project to be really interesting. I have experience using Pytorch and Keras for implementing Deep Neural Networks in Python. I’m also quite comfortable using java. I would really appreciate advise on what I need to do next to have my application accepted and also help in choosing the project that would suit my skills the best.

Hi, thanks for your interest in LT and GSoC. The simple answer on what to do next is: start working on it, plan the details, and report about your progress here. Also see what other are working on, e.g. at

Hi Daniel,

Is there a specific data set that you want me to use or can I choose whatever suits my needs best?

You can use whatever is available under a free license. It also helps of course if it’s not just available for English.

Hey Daniel, I’ve been trying to create a seq2seq model to create a toy version that can solve the problem. The memory constraints on even a small dataset of around 10000 samples with a vocabulary of 3000 words are far beyond my systems limitations which has just a 2GB GPU. I was wondering if we would be provided any GPU clusters or would we have to use AWS during the gsoc program. Also do you have any suggestions as to how I can show my work? The problem with making the dataset smaller would be that the Deep Net doesn’t learn as well it should. I will try decreasing these parameters and see how it goes but would welcome any suggestions.

Meanwhile, I have also identified how we can deploy the deep learning framework using java. I have also been studying about using different “attention” techniques so that the decoder can identify which parts of the encoders hidden layers that it needs to take into consideration while decoding the result. I also observed that the pairs of confusing words have a very small levenshtein distance. We can make use of this heuristic to ensure that the seq2seq model doesn’t recommend replacing words that are in no way related.

We have very limited budget for that. Can your university provider servers maybe?

That’s often the case, but not always, e.g. “mite” and “might” have a distance of 3, which isn’t really small, considering how short the words are.

True but a seq2seq model might have the tendency to replace a word completely unrelated story might be replaced with games. So any false positives flagged by the seq2seq model can be suppressed especially for longer words. Also would we just need to worry about words present in the confusion-word pair on github or do we expand beyond this? If the former is true I guess that we won’t be needing to worry about the heuristic distance as long as the word is replaced by the corresponding confusion-word pair.

I’ll ask around but can’t assure anything.

In the long term, we should consider more words. The current list is mostly manually created. A list could also be created automatically, going further, e.g. so that a singular and plural form of a noun are considered a confusion pair etc.

Hey Daniel, I was thinking about using a completely different approach to solve the problem. Word embeddings such as Glove and word2vec are created by using a concept called continuous bag of words/ skip grams. The basic idea is that you encode the semantic meaning of a word by using the words surrounding words (both to the left and right of the word). These word vectors are in fact created by training a deep net with the words neighbors and then having to predict the word or vice versa.

Since these vectors are freely available online we can use them to create a function that gives a score to each word in a confusion pair. For example if the sentence is what is the some of two and two. We calculate the score of “some” and “sum” given (“what”, “is”, “the”,“of”,“two”,"two) and the right word is the word with the highest score

The advantage of this method is:
1)We don’t need to train neural networks(We can if we want to though)
2)There are word vectors freely available for more than 100 languages.
3) This method will surely be fast since we just need to calculate a log probability score.
4)Easier to migrate to Java as well.

What do you think about this?

This is similar to what we do with n-grams. The n-gram approach currently has these drawbacks:

  1. it requires large data sets
  2. it only works up to 5grams (because more is usually not available), and we even use 3grams only
  3. it doesn’t work if original word and replacement have a different number of tokens (not because it cannot work, but a solution hasn’t been coded yet)

I think the approach you suggest wouldn’t address 1, as the vectors are usually quite large, aren’t they? But if the result is better in terms of quality (precision/recall), we might accept it. So it’s worth a try.

BTW, have you checked Neural Network Rules?

We can choose the length of a vector for each embedding. For example Polyglots pre trained word embeddings are 64 length vectors. Glove Embeddings give us the choice of 100,200,300 and 500. Some other options available are word2vec and FAST.

Won’t we need large data sets if we want to train the embeddings from scratch. With pre-trained vectors available online we won’t need to train a model. We can directly use what’s available online. To compute the scores we would just take a dot product over a vector and sum it up. Since we will restrict computing the dot products only for the word pair in question I believe it’s computationally light.

However I will come up with a small demo and check, thanks for pointing me to the Neural Network Rules.

Also how well does the tool perform for identifying confusing words for other languages? This approach might be a good baseline for languages that aren’t supported as well as English because of its predominance.

This is what I thought would be a good approach initially. You are basically talking about a Neural Language Model right? where the “score” is the probability of the word occurring given context?. But I think the problem is : How do you know it is for “some” and its corresponding confusion pair “sum” that you have to calculate the score for? i,e how will you know that “some” was used incorrectly here for choosing it to calculate the score?

My proposal is to run calculate the score every time there is a sentence that contains a word in the confusion pair. If the word was used in the right context its score would be higher than the word it’s paired with in the list of confusion pairs. It’s not different from a seq2seq model. In the seq2seq model you would have to run the predictor for every sentence.

But then, if there are more than 1 confusion-pair word in the sentence you would have to run the algo for all possible combinations of these confusion words. So if you have n occurrences of confusion pair words in the sentence, the algo complexity will be O(2^n).
Function words like “and” are present in the confusion set and they occur a lot in a sentence. So, if you have a long correct sentence having a lot of “and” , wont it run 2^n computations even when there is no error in the sentence?

Yeah, you are right. the seq-2-seq would learn these scores implicitly.

Yeah that’s true have to figure a way around that problem.

How do you plan to handle different languages creating a model for each language is going to be quite tedious.

Having looked at https://github.com/gulp21/languagetool-neural-network/blob/master/src/main/python/nn_word_sequence.py

I realized that the solution that I proposed is very similar to what has been implemented and deployed here. However we can surely go beyond the 5 surrounding words limitation. A python library called gensim exists an this library can be used to create the Continuous Bag of Words model for any language provided we have good training data. A function called score exists that assigns scores to sentences. We can limit the scores to just the confusion pair. Since it’s possible to have multiple confusing words in a sentence we can have a sliding window across a span and score the window to avoid explosion of the input space.

I would also like to know if there is any problem with the existing neural net approach and if what I suggested so far could make a good proposal.

IIRC, the main issue is that it doesn’t deal with multi-token terms yet (e.g. your vs. your're). Also, someone would need to take the time and “convert” all the ngram pairs we have into the NN approach. This is tricky as training material is needed even for rare words.