[GSoC] Extending AI approach

I went through that doc but it doesn’t explain how it handles multiple confusion pairs in a sentence.

What would be the tri-grams chosen if the sentence is “Find the sum and difference of your thing”
“sum” occurs in the confusion pair [“some”, “sum”] ; “and” occurs in the confusion pair [“and” , “end”]

I think it doesn’t. Both occurrences will be handled independently of each other, IIRC.

Hey,
I have finished the first draft of my proposal. You can find it here
Please let me know how I can make it better!
Thanks!

Thanks, I think the proposal looks good. Some ideas / remarks (sorry if these have been answered already here in the thread - in that case, it should just be added to the proposal):

  • Where exactly will the data come from? I understand it must be high quality data without errors?
  • “Find community members for future user testing of the proposed model for English , French and German.” -> I think evaluation should be mostly automatic, i.e. you cannot rely on people to “test” your models in the sense that they look at its results for hours. You should in general write more about evaluation. How exactly will this be evaluated? e.g. precision/recall?
  • Do you speak French or why was French your choice?
  • Some words about how this can be integrated into a Java-based software like LT would be nice, e.g. are there problems to expect? Is the library you use for training also available for Java or is that not needed etc.
  • “The proposed model does not treat multiple confusion words separately but it would take the entire sentence as context and will detect all the errors in the sentence.” -> This should be explained a bit - it depends on the training data also having these kinds of “multi errors”, doesn’t it?
  • If possible, please list your prior experience in this area and any other arguments why you should be selected (as we might not get enough slots, we might need to make a choice).

Hi, drex
I have read your proposal and I think your idea is really great! But I still have some questions.

  • Module 1 sends the wrong sentences from a right one by replacing words with confusion words. But as we know the spelling error is unpredictable, how can you ensure every missplelling words in the confusion set?

    For example, the confusion set for you is {your,you’re,yours,youth}. How to ensure the nonwords error like{yu,yo,yoou,yuo} be corrected if they are not in the confusion set?

  • For this kind of sentences: "I will go a home now" and "Can you give me pen?" which have one more or one less count word, how can the system detect them?

  • If the sentence has a wrong verb conjugation,like "Can you gave me a pen?", can the system discover it?

  • In addition, proper noun is sometimes a very important part for the text proofreading. A NN model will not tag a POS for each word and in consideration of the computational efficiency, the words which have low frequency can never be decoded successfully.

    For example: "Do you know C programming language? No,I'm not good at C#,But I have some knowledge in C++."

I think the seq2seq may be a good approach. But I think it is a waste if you only use it to check the confusion words. Perhaps you can get some inspiration from my proposal. I come up with a similarity algorithm to solve the confusion words problem and its worst time complexity is O(n), if I adjust the query algorithm the average time complexity is O(1).It can do the same job with yours but more efficiently. However I haven’t had a try using it in English language. But I think it is a not bad idea. :grinning::grinning::grinning:

Thanks for the feedback!
Where exactly will the data come from? I understand it must be high quality data without errors?
Since I’m preparing the training data in Module 1 the dataset will just be correct sentences in the language. I had given this a lot of thought on what dataset to use but I think its best if I don’t constrain myself to one dataset. I want the sentences to be as diverse as possible, spanning multiple domains. So ill take sentences from tatoeba and, google has the billion word dataset (english), IMDB movie review dataset, Twitter dataset etc… Basically, I want to incorporate as many sentences as possible until overfitting.

“Find community members for future user testing of the proposed model for English , French and German.” -> I think evaluation should be mostly automatic, i.e. you cannot rely on people to “test” your models in the sense that they look at its results for hours. You should in general write more about evaluation. How exactly will this be evaluated? e.g. precision/recall?
The week before every Deliverable I will give a detailed report of the model including its precision and recall and the time it takes to run, the number of parameters trained, training time. etc
I wanted some language experts to just “check” if the model is working and their inputs on what test cases it failed. They don’t evaluate the model.

They can also help me determine if the dataset that I have collected is of high quality.

Do you speak French or why was French your choice?
I studied French for 2 years in High school although I am nowhere near fluent. I chose French as I can find a lot of sentences in French online. Provides ample amounts of training sentences. Same for German.

Some words about how this can be integrated into a Java-based software like LT would be nice, e.g. are there problems to expect? Is the library you use for training also available for Java or is that not needed etc.
I add this to the proposal. Thanks!

“The proposed model does not treat multiple confusion words separately but it would take the entire sentence as context and will detect all the errors in the sentence.” -> This should be explained a bit - it depends on the training data also having these kinds of “multi errors”, doesn’t it?
I’ll add how sentences having multiple errors are generated by Module1. Thanks!

If possible, please list your prior experience in this area and any other arguments why you should be selected (as we might not get enough slots, we might need to make a choice).
Okay, I will add this to the proposal. Thanks!

Hey t0iiz,
Thank you very much for the feedback!

The scope of this project is just for correcting confusion pairs. So, spellchecking and other features you mentioned are not handled by this project.

I think the seq2seq may be a good approach. But I think it is a waste if you only use it to check the confusion words.
Well, I found this project on the Missing Features page here
Spellchecking and other features like verb conjugation you mentioned can be achieved if I use a char level seq-to-seq model but then I would need more powerful GPUs and a lot more data. (Trust me , I tried :smile:. Got very bad results)

Perhaps you can get some inspiration from my proposal
Forgive me, but I didn’t quite understand your methodology. Why convert it to a shortest path problem on a DAG? You use bi-grams for confusion words?

I come up with a similarity algorithm to solve the confusion words problem and its worst time complexity is O(n), if I adjust the query algorithm the average time complexity is O(1)
AFAIK, the time complexity of a seq-to-seq model is O(n) only.

It can do the same job with yours but more efficiently.
I doubt it :stuck_out_tongue:

There will be incorrect sentences in those corpora, how can we deal with that? Or will we ignore that, hoping it doesn’t matter?

Depends. Colloquial styles may be used in, say, Twitter dataset. Would that be categorized as incorrect?
If there is one incorrect sentence per 6000 correct sentences in a dataset of 1 million sentences, then I don’t think it will have a significant impact on the learning of the model. A few outliers won’t change the weights of the model significantly. If there are too many incorrect sentences, then I won’t use that dataset to train.

I guess it would be safe to assume that any dataset that I find online will only have few incorrect sentences. I will confirm that before using it to train.

LT usually doesn’t consider colloquial style as incorrect, although there are a few style rules about it (at least for German).

Sorry… Maybe I didn’t explain clearly. I mean every word has its own shape and pronounce information. And you can set up a unique number sequence for each word by the information. Since the misspelling words have the similarity the true, the number sequences of these words are also similar with each other. For that reason, to build a confusion pair for a word is to find the similar sequences of the word.

Hello,
I have finished making the changes to my proposal and the changes are marked so it would be easy to find.
Please let me know if I have missed something.
Thanks!

“Compiling such a dataset will not be hard as the internet contains millions of sentences in English, French and German.” - in case you have a more detailed idea, it should probably be mentioned. For example, are you aware of http://commoncrawl.org?

Hey, I am new to this platform,have downloaded and trialed the tool .
can i apply for gsoc?

Hi!
I have done the following course on machine learning:

I am currently doing the following course on machine learning:

I am familiar with angularjs and python
I want to pursue my career in data science
I found these project ideas may suit the skill set i possess
I am new to all of this
I need some sort of insight in correlation to my skill set and the requirements of the organisation
Any sort of guidance will be appreciated!

Hi Sababa, thanks for your interest, but you’re a little late, as the deadline ends in about 24 hours. Sorry!

Hey Daniel, when I wrote “the internet contains a lot of sentences” I meant that it contains a lot of freely available datasets that I can use to train each language. If I understood commoncrawl correctly, it provides a dataset of web pages? But they might be noisy and sentences may not be correct (Because its just raw web pages?).

If the proposal needs to be more structured then how about I stick to the europarl dataset for the 3 languages? They contain ~ 2 million sentences for each language. The datasets can be found here. And also Google’s billion word dataset for English.

Okay, I just wanted to make sure you’re aware that selecting and maybe filtering the data set(s) is an important part of the task. Yes, commoncrawl is unstructured and probably not easy to work with, but it has very broad coverage.

Thanks, I’ll make the necessary changes in the proposal.

@drex When you say “The current LT model” in the proposal, does that refer to the ngram model we use and/or to the approach described here?