Back to LanguageTool Homepage - Privacy - Imprint

[GSoC weekly] Confusion Pair Correction Using Sequence to Sequence Models


I will provide weekly updates on my GSoC project in this thread. Previous discussions on this project can be found here.

Project : Confusion Pair Correction Using Sequence to Sequence Models
Mentor : @jaumeortola


[Update May 15th]
-> Created training data that covers instances of all confusion pairs.
-> Dataset contains 12,750,694 sentences out of which 921,997 are correct sentences and the rest are incorrect variations of the correct ones.
-> After doing some research, I figured i’ll use FastText embeddings rather than GloVe embeddings (Although I would like to do a comparison of FastText vs GloVe performance if time permits.)

Target for the next update
-> preprocessing for seq-to-seq model
-> Coding up the seq-to-seq model with bi-directionals LSTM for the encoder.


[Update May 18th]
-> Implemented preprocessing.
-> Implemented bi-directional encoder.
-> Going to start training the model.

-> What is the max input length of the sentence that LT takes? Right now im configuring the model to support max length = 20
-> While the model is training, I want to start working on the integration of the model to LT. I will use the DeepLearning4j library in Java for this. What would be the entry point of my system in the LT codebase? Need some help with that.

(jaumeortola) #4

There is no such a limit in LT. Sentences can be of any length. You can loop over every token in a sentence, and look at the 10 tokens before and 10 tokens after. That sounds reasonable. You can adjust these parameters later after testing.

You will have to look carefully if there is any discrepancy between LanguageTool word tokenization and what you expect it to be. For example, currently in LT “you’re” is represented with three tokens. LT analysis can be seen here. If you need another tokenization, you can just join the whole sentence and split it again.

(jaumeortola) #5

The entry point will be what we call a “Java rule”. Here you have a simple demo rule.

In a Java rule you can loop over the tokens of a sentence. Each token has a word form, and one or more lemmas and POS tags. If the token has been properly disambiguated, there is only one lemma and one POS tag, but that is not always possible or desirable. I guess you plan to use only word forms. But the other parameters could be useful. I will write more about this.

As a result, the Java rule returns an array of matches (possible errors). See here.

You will probably need an abstract Java rule plus another rule for each language (English, German, etc.) with the language data. See some abstract rules here and some English implementations here.


[Update June 2nd]

-> Finished training the model (English Version 1.0) on the tatoeba dataset
-> Git repo :
-> The data and the pickle files are not uploaded on github as they are too large for github
-> Started training the model on the British National Corpus


[Update June 13th]
-> Model trained on the BNC corpus. The git repo have been updated.
-> Regarding DL4J integration, I have raised an issue on their git and they are looking into it.

On 300 sentences of seen (trained) data :
Precision : 0.96327716754214391
Recall : 0.9012584882220365

On 300 sentences of unseen data from the europarl corpus:
Precision : 0.8451816745655608
Recall : 0.8143074581430746

Avg Time taken per sentences : 0.12467655142826432 seconds

I expect the results to improve once I have trained it on the europarl corpus.
The output of my system is a sequence of 0s and 1s
I also have a ground truth sequence of 0s and 1s
Here is how I have defined my metrics.

Precision = TP/TP + FP
Recall = TP/TP+FN

(Daniel Naber) #8

Thanks for the numbers. Sorry if this has been discussed before, but will there be an easy way to tune for better precision?


There are a few thing I want to try out that will improve the results :

  1. The most obvious one is to train it on more data. I plan to train it on the europarl corpus as well.
  2. I have noticed that some of the wrong predictions are caused due to “you’re”. I use NLTKs tweet tokenizer which treats the the entire word “you’re” as 1 token. But Glove does not have an embedding for “you’re”, it has embeddings for “you” and “'re” . Therefore while training it treats “you’re” as an or unknown token. Need to fix that part.
  3. FastText embeddings perform better than Glove. Will need to test if FastText produces better results than Glove.

(Daniel Naber) #10

Thanks. I didn’t only mean general improvement, but is there an easy way to tune for precision, accepting a lower recall? We do that with our ngram approach by changing a threshold and it helps to avoid false alarms.


I’m not exactly sure on how to do that. Are you saying in your ngram approach you kept a threshold that gave fewer False Positives? I will check it out

(jaumeortola) #12

As Daniel said, it would be preferable to have more precision (fewer false positives), even at the expense of less recall (more false negatives). Let’s see if that is possible.

Your test set should be larger to find more interesting information.

I have tested some thousand correct sentences from the Europarl corpus, containing some words in the pair list (either, bounds, man…). Around 13% of the sentences give a false positive. We should try to minimize this number of false alarms.

Strangely, some false positives are related to words not present in the list. Is that possible? Could we simply ignore these positives? I’m not sure if we have talked about this before.

For example, this false positive is for the word “as”, but “as” is not in the list, or is it?

['I', 'would', 'stress', 'here', 'quite', 'clearly', 'and', 'unequivocally', 'that', ',', 'as', 'far', 'as', 'the', 'objectives', 'are', 'concerned', 'and', 'the', 'need', 'to', 'provide', 'the', 'necessary', 'funds', 'for', 'them', ',', 'there', 'is', 'no', 'dissent', 'whatsoever', '.'] 000000000010000000000

This is for “would”, but “would” is not in the list:

['I', 'would', 'point', 'out', ',', 'in', 'any', 'case', ',', 'that', 'one', 'of', 'the', 'organisations', 'with', 'which', 'Osama', 'bin', 'Laden', 'is', 'linked', 'is', 'called', 'the', "'", 'International', 'Front', 'for', 'Fighting', 'Jews', 'and', 'Crusades', "'", '.'] 010000000000000000000

I attach a file with more examples like these: (45.0 KB)

Could you publish a plain text version of the list of confusion pairs you are using so we can see it?


The selected confusion pairs :

Not selected :

“as” was in the confusion pairs list but I had removed it before training the model.
“would” is not there in the confusion pairs.

These cases will be ignored. When a ‘1’ is encountered, the system will do a lookup in the confusion_dict to find its corresponding confusion word. But in the case of “as” it finds that “as” is not in the confusion dict and hence does not change it. But this is a sign that the model should be improved, it’s not supposed to tag non-confusion words as ‘1’.


@dnaber the french confusion pairs found here have only ~50 confusion pairs compared to ~700 for english. Shall I train the model with just these pairs or is there any way I can get more to train on?

(Daniel Naber) #15

You could try to automatically find similar words. Or find lists of commonly confused words on the web. But starting with those ~50 pairs would also be fine.


[Update July 11th]

I have streamlined the code for the training process. Instruction on how to run is added to the Git repo.

Regarding the French confusion pair correction: So far I have trained 6 different version of the model but the results produced are not good. This is because the number of confusion pairs in French is less as compared to that in English. So the number of incorrect sentences formed <<< number of correct sentences.

I have fixed the problem now by using (Tatoeba + europarl + French News Crawl Articles 2013 + French News Crawl Articles 2008) and removing sentences which doesn’t have a confusion word (with a probability of 1/10). The model is being trained now.