[GSoC] Extending AI approach

drex · March 2, 2018, 3:46pm

Hi,
So, over the past few days, I tried various encoder-decoder models which take in incorrect sentences as input and outputs correct sentences. (Just your/you’re confusion pair)

Character Level seq2seq model
First I tried a character level model. Trained it for 2 days, 120 epochs, dataset of size 26046. But the results were not promising as the number of parameters to train in a character level model is very huge.
And then there is the problem of long term dependencies when the sentences are long.

Word Level seq2seq model
I used GloVe embeddings, trained it for 2 days ,100 epochs, It gave some correct results but still not good enough.
Here are some results.

Input : Thanks for your’re explanation.
Output : Thanks for your explanation

Input : Open your’re mouth!
Output : Open your face!

Input : Which is your’re luggage?
Output : Which is your luggage

Input : I need your advice.
Output : I need your advice

Input : Put your’re hands down!
Output : Put your hands up!

Input : Enjoy your’re meal!
Output : Enjoy your bed!

Input : Here is your change.
Output : Here is your change

Input : On your feet, children!
Output : On your story, please!

But longer sentences becomes a problem. It learned to correctly distinguish between when to use ‘your’ and when to use ‘you’re’ but the generating decoder part is erroneous. It can be made better (maybe) if I gave it more training data and trained it for more epochs.

Shall I try another approach which I think might work? Since the model knows when to use ‘your’ and when to use ‘you’re’ depending upon the context, I can train it to figure out where exactly in the sentence (which words) are incorrect. And change just those words. This would greatly reduce the complexity as it would avoid generating the whole sentence.

dnaber · March 2, 2018, 5:01pm

That sounds like a good idea. Thanks for your efforts so far! Are you training on a powerful computer with a GPU?

drex · March 3, 2018, 6:53am

I have access to a cluster consisting of forty Boston SYS-7048GR-TR nodes equipped with dual Intel Xeon E5-2640 v4 processors. Each node has 128 GB of 2400MT/s DDR4 ECC RAM and four Nvidia GeForce GTX 1080 Ti GPUs.

drex · March 6, 2018, 4:41pm

Update

So I trained a new model following the new approach. This just took 2 hours to train 52000 sentences for 100 epoch. The results look promising.

Input sentence: How old are your brothers and sisters
output : 0 0 0 0 0 0 0

Input sentence: How old are you’re brothers and sisters
output : 0 0 0 1 0 0 0

Input sentence: How long does it take to get from here to your house on foot
output : 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Input sentence: How long does it take to get from here to you’re house on foot
output : 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Input sentence: Does you’re dog really have a library card
output : 0 1 0 0 0 0 0 0

Input sentence: Wash you’re plate, please
output : 0 1 0 0

Input sentence: How many times do I have to tell you that you’re method is right
output : 0 0 0 0 0 0 0 0 0 0 1 0 0 0

Input sentence: If your going that way please get me you’re tape recorder
output : 0 1 0 0 0 0 0 0 1 0 0

If the ith position in the output is 1 , that means the ith word in the input sentence is wrong. Ill put the code up on github as soon as I clean it up a bit.

So, this model correctly predicts which words in a sentence are wrong. So ill just check which words produces 1 in the output and if that word is ‘your’ , ill just change it to ‘you’re’ and vice versa.
But to generalize this approach to other confusion pairs I need a dictionary of common confusion pairs. Does LT have such a dictionary or a list of common confusion words?

dnaber · March 6, 2018, 6:48pm

For English, see confusion_* at languagetool/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en at master · languagetool-org/languagetool · GitHub - it doesn’t contain multi-token sets, though.

drex · March 10, 2018, 6:45am

I have extended the model for these confusion pairs as well.
[‘edition’, ‘addition’],
[‘shall’, ‘shell’],
[‘see’, ‘sea’],
[‘role’, ‘roll’],
[‘isle’, ‘aisle’],
[‘your’ , “you’re”],
[‘which’, ‘witch’],
[‘wrong’, ‘wring’],
[‘wrongs’, ‘wrings’],
[‘things’, ‘thinks’],
[“dont”, “don’t”]

Try running the code from here
You can add your input sentences to “test.txt” and see the results.
Let me know the cases where the outputs are incorrect.
You will need the latest versions of keras and tensorflow installed though.
This approach seems to work and can be easily extended to other confusion pairs as well. Shall I start writing my proposal based on this approach ?

drex · March 10, 2018, 6:46am

Multi token sets? Can you give me an example?

dnaber · March 10, 2018, 8:00am

e.g. your ↔ you're (you're is three tokens in LT) - but as you’ve used don't already, that should be covered (BTW, is dont even a word?)

I haven’t tested this yet, but: yes. Please remember that we’re not looking for a stand-alone model, but we want to extend LT, so we need this to be integrated into LT, which is Java-based. We also can only integrate features that don’t slow us down too much. For example, checking an average-length sentence with LT currently takes maybe 20-40ms or so.

drex · March 10, 2018, 11:31am

It’s a common spelling mistake. Oh wait, are simple spelling mistakes already handled by LT? I just realized all the words in the confusion pairs are actual words in English. I had assumed they were words with spelling mistakes that needed to be corrected. But now that I think about it, correcting all minor spelling mistakes might be beyond the scope of this model. example: the model will not fix “behynd” to “behind” unless I explicitly tell the model that this is a confusion pair. And manually listing all possible incorrect spellings of a word won’t be feasible. So, spelling mistakes won’t be handled, but it will correctly learn when to use the words listed in the confusion pairs. Is that good enough?

drex · March 10, 2018, 11:34am

Please do, I just hoped it will work on other systems other than mine

drex · March 10, 2018, 11:35am

I have coded in Java before. Won’t be a problem writing a Java wrapper code around the models I build in python.

drex · March 10, 2018, 11:37am

I’ll test the speed of the current model and let you know.

dnaber · March 10, 2018, 2:02pm

Yes, so detecting dont won’t be useful - unless your approach provides better suggestions.

drex · March 10, 2018, 3:33pm

So how about using the existing LT model to find the spelling mistakes and then running my model to detect confusion pair errors ?
Can you please direct me to the code where spelling mistakes are found, on the LT git page? Or any documentation related to how LT handles spell checking. Want to familiarize myself with the LT codebase.

dnaber · March 10, 2018, 4:09pm

Yes, that’s how LT works anyway: all rules are applied to the text. The speller rule will find spelling errors, your rule would find confusion errors. They don’t run after another, but they are independent. It’s important to understand that we already detect confusion pairs (here), so a new approach only makes sense if it’s better (precision/recall), faster, more general (e.g. multi-token matches), needs less data (we now rely on Google data available only for 6 languages), or all of that.

Documentation can be found at Spell check - LanguageTool Wiki. The code for English is here: languagetool/languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/AbstractEnglishSpellerRule.java at master · languagetool-org/languagetool · GitHub

drex · March 17, 2018, 10:40am

I went through the current LT code for finding confusion errors here. Wanted to know if my understanding is right.

Given input sentence : “Show that it is you’re problem to do this”.
You first find all the words in the confusion pair list . (In this case you’re) And then calculate the tri-gram probabilities of “it is you’re” and “it is your” and choose the one with the highest probability.

But what if there are multiple confusion words in a sentence?
Example : “Show that it is you’re end problem to do this”.
In this case both “your’re” and “end” is there in the confusion pair list. [your , you’re] and [end , and]
Do you find the probabilities of all the possible combinations of these words?
i.e, Do you find the tri-gram probablilities of all these combinations : “is you’re end” , “is your end” , “is your and” and “is you’re and” ?

drex · March 17, 2018, 10:40am

Also going through the conf pair list I found pairs like
[same, sane]
[same, some]
[some, sum]
If “sum” is confused with “some” and “some” is confused with “same”, is it safe to assume that “sum” can be confused with “same”. As in, does transitivity apply here?

SkyCharger001 · March 17, 2018, 11:01am

Personally I think that “same” and “sum” differ a little too much in meaning for transitivity to apply.

dnaber · March 17, 2018, 11:33am

Here’s a documentation of how it works: Finding errors using n-gram data - LanguageTool Wiki. Also, this doesn’t apply to you're as multi-token words are not yet supported by the ngram approach.

drex · March 17, 2018, 12:17pm

Thanks! So no transitivity.