I’d prefer if you could discuss it here in the forum.
So far the model was formulated as a generative model when it generates a sequence of 1s and 0s. I want to change that to a classification model where as soon as my model finds a confusion word, it classifies it as a 0 or a 1. Instead of generating 0s and 1s , I classify each confusion word as a 0 or a 1.
For this, I have to train separately for each confusion pair. So there will a model trained for your/you’re and a separate model trained for roll/role etc. The encoder model that I am using currently will more or less be the same but the decoder model will be removed and replaced with a classifier.
-> A non-confusion word will never be classified as an error.
-> Sequence-to sequence generative models require a lot of data. This would require significantly less.
In theory, I figured the current model is not working is because the output space of the decoder is too large. Also, in encoder-decoder models, the ith prediction of the decoder is dependent on the repeat vector (from the encoder) and the i-1th prediction (of the decoder). This will work well in Machine translation because the ith word generated is dependent on the word that comes before it. But its not suitable for our task (because the ith 1 is not dependent on the previous 0 or 1). This is my best guess. On hindsight, I should have treated it as a classification problem from the start.
Sounds good. Not much time is left for GSoC, though, so what are your other To Do items?
The seq2seq approach is now used also used by Google it seems (source), but maybe they just have much more data. And some tricks we don’t know about.
Do you think you will be able to try this new approach in the remaining time? There are very few days before the deadline.
Our objective should be to learn as much as we can (with proper evaluations) to decide which approach is the best and what to do next.
Priorities should be:
- Try the new approach (?). Evaluate (in English). Include precision and recall for each pair, so that we can compare with other methods currently used (n-grams, neural network).
- Make sure the documentation is clear enough to reproduce the work in other languages.
As for integration in LanguageTool, I have tried a Java-Python mock-up and it seems quite straightforward. We’ll try it this week.
Yep, they use a char level seq2seq model. It will require a ton of data and a lot of computational power.
Already started working on it. Yes, I believe I can make a working model before the deadline ends which will have support for a few confusion pairs but I plan on working on this even after the GSoC period ends. This is the topic I have chosen for my Masters thesis , so I’d be working on this either way.
I agree. The entire documentation will be ready by Monday.
During the last 3 months, I developed a sequence to sequence model for confusion pair correction. The model takes in a sequence of words and outputs a sequence of 1s and 0s to determine if a word was used incorrectly in the sentence.
All my work along with the documentation can be found here.
Currently, the model supports English and French. Since the model does not produce the desired accuracy, it was not extended to German. Rather, I spent the last phase of GSoC figuring out a new way to solve the problem. I realized the inherent flaw was treating the task as a generative problem rather than a classification one. I am going to continue working on this project by adopting a classification approach for this task. The new approach is described in the “Results and Future Work” section in the link provided above.
I’d like to thank the LT community for the opportunity and @jaumeortola for his guidance.
Hi @drex, thanks for your work during GSoC! We’re looking forward to your future contributions.
I was once told there are no stupid questions. But this may be one. I read the documentation, and was struck by this: The text states the entire sentence is used, but then it says: ‘If the ith word is a confusion word, then LSTM2 is used to capture the context from start_of_sentence to the ith word’. Which is only the part before.
I think the text after could be just as significant… Am I reading this the wrong way?
Sorry, it wasn’t explained thoroughly. Consider we are making a model to just correct role/roll confusion pairs.
Take this sentence for example: " Your roll in this project is different from her role in the documentation."
As you can see, roll in the 2nd position is used incorrectly and role in the 10th position is used correctly.
First we run LSTM1 to get the meaning of the entire sentence.
Then we iterate through each word in the sentence and we see that roll in the second position is in our confusion-pairs list. We use LSTM2 from start_of_sentence to the 2nd postion. Both the outputs of LSTM1 and LSTM2 are given to the classifier to predict if roll in the 2nd position is used correctly or not. The same is done when we iterate to role in the 10th position.
So inshort we use LSTM2 to differentiate betweeen roll in the 2nd position and role in the 10th position. LSTM2 is used to provide positional information to the model.
If we don’t have LSTM2 then the model will not know which role/role pair is used incorrectly in the sentence.
I am very interested in using this for Dutch. Would it be of any use to start making a language model now?
For the model architecture I am working on, it doesnt need a language model to be trained as I will use pre-trained word embeddings. If the n-gram model that LT uses right now has not been done for dutch and if there is Dutch data available then yes!
Can someone help me to get started on this?
There is dutch data. But I need to know the requirements of the data itself. What should it look like? How ‘clean’ should it be?
Hey, If you want to make an n-gram LM as described here,then the data should just be dutch sentences from which n-grams can be made.
The cleaner the better! Preferably from a gold standard dataset of Dutch sentences.
Hm. As far as I know there is no ‘gold standard set’. But I can try to collect the best data there is.
For me, the rest of the procedure seems to be too technical.