[GSoC] Extending AI approach

Hi,
So, over the past few days, I tried various encoder-decoder models which take in incorrect sentences as input and outputs correct sentences. (Just your/you’re confusion pair)

Character Level seq2seq model
First I tried a character level model. Trained it for 2 days, 120 epochs, dataset of size 26046. But the results were not promising as the number of parameters to train in a character level model is very huge.
And then there is the problem of long term dependencies when the sentences are long.

Word Level seq2seq model
I used GloVe embeddings, trained it for 2 days ,100 epochs, It gave some correct results but still not good enough.
Here are some results.

Input : Thanks for your’re explanation.
Output : Thanks for your explanation

Input : Open your’re mouth!
Output : Open your face!

Input : Which is your’re luggage?
Output : Which is your luggage

Input : I need your advice.
Output : I need your advice

Input : Put your’re hands down!
Output : Put your hands up!

Input : Enjoy your’re meal!
Output : Enjoy your bed!

Input : Here is your change.
Output : Here is your change

Input : On your feet, children!
Output : On your story, please!

But longer sentences becomes a problem. It learned to correctly distinguish between when to use ‘your’ and when to use ‘you’re’ but the generating decoder part is erroneous. It can be made better (maybe) if I gave it more training data and trained it for more epochs.

Shall I try another approach which I think might work? Since the model knows when to use ‘your’ and when to use ‘you’re’ depending upon the context, I can train it to figure out where exactly in the sentence (which words) are incorrect. And change just those words. This would greatly reduce the complexity as it would avoid generating the whole sentence.

That sounds like a good idea. Thanks for your efforts so far! Are you training on a powerful computer with a GPU?

I have access to a cluster consisting of forty Boston SYS-7048GR-TR nodes equipped with dual Intel Xeon E5-2640 v4 processors. Each node has 128 GB of 2400MT/s DDR4 ECC RAM and four Nvidia GeForce GTX 1080 Ti GPUs.

Update

So I trained a new model following the new approach. This just took 2 hours to train 52000 sentences for 100 epoch. The results look promising.

Input sentence: How old are your brothers and sisters
output : 0 0 0 0 0 0 0

Input sentence: How old are you’re brothers and sisters
output : 0 0 0 1 0 0 0

Input sentence: How long does it take to get from here to your house on foot
output : 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Input sentence: How long does it take to get from here to you’re house on foot
output : 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Input sentence: Does you’re dog really have a library card
output : 0 1 0 0 0 0 0 0

Input sentence: Wash you’re plate, please
output : 0 1 0 0

Input sentence: How many times do I have to tell you that you’re method is right
output : 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Input sentence: If your going that way please get me you’re tape recorder
output : 0 1 0 0 0 0 0 0 1 0 0

If the ith position in the output is 1 , that means the ith word in the input sentence is wrong. Ill put the code up on github as soon as I clean it up a bit.

So, this model correctly predicts which words in a sentence are wrong. So ill just check which words produces 1 in the output and if that word is ‘your’ , ill just change it to ‘you’re’ and vice versa.
But to generalize this approach to other confusion pairs I need a dictionary of common confusion pairs. Does LT have such a dictionary or a list of common confusion words?

For English, see confusion_* at languagetool/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en at master · languagetool-org/languagetool · GitHub - it doesn’t contain multi-token sets, though.

I have extended the model for these confusion pairs as well.
[‘edition’, ‘addition’],
[‘shall’, ‘shell’],
[‘see’, ‘sea’],
[‘role’, ‘roll’],
[‘isle’, ‘aisle’],
[‘your’ , “you’re”],
[‘which’, ‘witch’],
[‘wrong’, ‘wring’],
[‘wrongs’, ‘wrings’],
[‘things’, ‘thinks’],
[“dont”, “don’t”]

Try running the code from here
You can add your input sentences to “test.txt” and see the results.
Let me know the cases where the outputs are incorrect.
You will need the latest versions of keras and tensorflow installed though.
This approach seems to work and can be easily extended to other confusion pairs as well. Shall I start writing my proposal based on this approach ?

Multi token sets? Can you give me an example?

e.g. youryou're (you're is three tokens in LT) - but as you’ve used don't already, that should be covered (BTW, is dont even a word?)

I haven’t tested this yet, but: yes. Please remember that we’re not looking for a stand-alone model, but we want to extend LT, so we need this to be integrated into LT, which is Java-based. We also can only integrate features that don’t slow us down too much. For example, checking an average-length sentence with LT currently takes maybe 20-40ms or so.

It’s a common spelling mistake. Oh wait, are simple spelling mistakes already handled by LT? I just realized all the words in the confusion pairs are actual words in English. I had assumed they were words with spelling mistakes that needed to be corrected. But now that I think about it, correcting all minor spelling mistakes might be beyond the scope of this model. example: the model will not fix “behynd” to “behind” unless I explicitly tell the model that this is a confusion pair. And manually listing all possible incorrect spellings of a word won’t be feasible. So, spelling mistakes won’t be handled, but it will correctly learn when to use the words listed in the confusion pairs. Is that good enough?

Please do, I just hoped it will work on other systems other than mine :joy:

I have coded in Java before. Won’t be a problem writing a Java wrapper code around the models I build in python.

I’ll test the speed of the current model and let you know.

Yes, so detecting dont won’t be useful - unless your approach provides better suggestions.

So how about using the existing LT model to find the spelling mistakes and then running my model to detect confusion pair errors ?
Can you please direct me to the code where spelling mistakes are found, on the LT git page? Or any documentation related to how LT handles spell checking. Want to familiarize myself with the LT codebase.

Yes, that’s how LT works anyway: all rules are applied to the text. The speller rule will find spelling errors, your rule would find confusion errors. They don’t run after another, but they are independent. It’s important to understand that we already detect confusion pairs (here), so a new approach only makes sense if it’s better (precision/recall), faster, more general (e.g. multi-token matches), needs less data (we now rely on Google data available only for 6 languages), or all of that.

Documentation can be found at Spell check - LanguageTool Wiki. The code for English is here: languagetool/languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/AbstractEnglishSpellerRule.java at master · languagetool-org/languagetool · GitHub

I went through the current LT code for finding confusion errors here. Wanted to know if my understanding is right.

Given input sentence : “Show that it is you’re problem to do this”.
You first find all the words in the confusion pair list . (In this case you’re) And then calculate the tri-gram probabilities of “it is you’re” and “it is your” and choose the one with the highest probability.

But what if there are multiple confusion words in a sentence?
Example : “Show that it is you’re end problem to do this”.
In this case both “your’re” and “end” is there in the confusion pair list. [your , you’re] and [end , and]
Do you find the probabilities of all the possible combinations of these words?
i.e, Do you find the tri-gram probablilities of all these combinations : “is you’re end” , “is your end” , “is your and” and “is you’re and” ?

Also going through the conf pair list I found pairs like
[same, sane]
[same, some]
[some, sum]
If “sum” is confused with “some” and “some” is confused with “same”, is it safe to assume that “sum” can be confused with “same”. As in, does transitivity apply here?

Personally I think that “same” and “sum” differ a little too much in meaning for transitivity to apply.

Here’s a documentation of how it works: Finding errors using n-gram data - LanguageTool Wiki. Also, this doesn’t apply to you're as multi-token words are not yet supported by the ngram approach.

Thanks! So no transitivity.