LT and GSoC 2018 - looking for students

Ruud_Baars · February 13, 2018, 11:12am

I don’t know if it is exciting, but maybe this is a chance to have a new suggestion approach to words?
All I would need is :
word,flags,suggestions.
e.g.:

BTW:oh:btw
meaning: BTW is an optional way of writing the word btw.
The rule should be able to use the flag (much like postag, but it is of course not a postag)

or

rijwiel:lf:fiets
using a rule creating the message: using ‘fiets’ makes the text less formal.

vikrant97 · February 18, 2018, 7:58am

I don’t know whether LT is looking for deep learning proposals for its spell checker tool or not, but still I have something in mind that I could do for GSoC’18. I am familiar with CNN & RNN’s. And I think I can build a model that can take a sentence with spelling mistakes as input, and output the same sentence, but with the mistakes corrected using RNN’s. I can do this python using tensorflow. But I wander if this is needed by LT or not.
If the org is interested in this I can write a good proposal regarding this. It would be very interesting for me. Thanks.

Ruud_Baars · February 18, 2018, 10:08am

This would be very interesting for Dutch at least, not for simple spelling issues, but for spellchecking multi word groups. Especially as it could simply use pairs wrong and corrected sentences as training material.

dnaber · February 18, 2018, 10:15am

Yes, we’re very much interested in this. For non ML programming, we usually expect students to submit a pull request with a small bug fix or small feature, do you think there’s any equivalent for ML?

Vishakha · February 18, 2018, 11:56am

Hey! My name is Vishakha and I am a GSOC '18 aspirant. Learning that AI is the main approach for languagetool to detect errors and correct them more efficiently got me intrigued. As a part of a project that I recently worked on, I used interesting ML algorithms for classification. The crux of the process was to create triples of sentences and their relationships as POS(Parts of Sentence). I used the triples over some trained sentences to analyse if they were questions, statements(pertaining to the application) or chat sentences. I have a strong belief that this method could be extended to train on correct sentences to identify errors over test sentences. I used JAVA based dependecies to extract the POS tags of words. This application can be achieved with the methodological use of Neural Networks. Moreover, Tensorflow could be a great tool to go ahead with this application. Does that sound like something that could be taken up? I see that the existing issues at wiki.languagetool.org are related to web development. Is there any other platform to go through the existing AI approach of languagetool where I could learn and contribute? Thank you!

dnaber · February 18, 2018, 2:28pm

Hi Vishakha, thanks for your interest in LanguageTool! Did you have a look at this thread? I think this is close to what you’re suggesting? So what we need is something that’s the next step: either extend the existing approach to also cover multi-token confusion (like your vs. you're), or a seq2seq approach that can find errors that are more complex than just work confusions.

drex · February 18, 2018, 4:54pm

Hey,
In the LT Ideas Page I came across the “Extend AI approach” task and it seems like something that I can contribute to as I am currently doing my research in encoder-decoder models. Is there any dataset that I can use to start training the seq-2-seq model on?

dnaber · February 18, 2018, 7:19pm

There’s no public data set, but we collect corrections from users who allow that. Anyway, the first thing I’d try is to generate errors, e.g. take sentences with “your” (e.g. from tatoeba) and replace it with “you’re” and you have incorrect sentences. As there might be errors in tatoeba, this might not work in 100% of the cases, but let’s hope the approach is robust enough to deal with that.

crypticmyna · February 18, 2018, 9:53pm

Hello sir!
I was wondering if LanguageTool was willing to give shelter to a completely new language. I am quite proficient in Hindi language and would love to help make it a part of LanguageTool family!

Also, I am really interested in Machine Learning algorithms like CNN etc.

dnaber · February 18, 2018, 10:17pm

Hi, welcome to LanguageTool. Yes, adding a new language could be part of a GSoC project. However, as adding a new language does not necessarily require a lot of programming, it would probably not be enough on its own. If you read this thread, you’ll find some links to the wiki with more ideas. And machine learning is interesting, we’ll need a good plan on how to approach it, though.

Ruud_Baars · February 19, 2018, 7:25am

Daniel, We could have several approaches. There is the 5-grams to 1 word confusion now. I can imagine one for sentence ending detection as well, and one for postagging, as well as a wrong sentence to right sentence approach. These could be different projects.

Hsankesara · February 19, 2018, 10:14am

Hello all, My name is Heet, and I am very much interested in LanguageTool and want to work in this organization. I am interested as well as experienced in ML, NLP and TensorFlow. I recently created a mail classifier which directs the mail just by reading its content to the concerning department of a company as my college project. It is a beginners project I know and I really want to contribute in this organization to learn and implement new ML algorithms. I know both python and java.

dnaber · February 19, 2018, 2:30pm

Hi Heet, thanks for your interest in LT. It would be great if you - and everybody else interested in ML - can come up with a more detailed plan and some prototype code for your ideas. In the end, you’ll need to write an application anyway. This quite a bit of work, but it vastly increases your chances of being selected for GSoC at LT.

drex · February 19, 2018, 3:01pm

So, I found some confusion sets here
. I combined all the pairs in confusion_sets.txt, confusion_set_candidates.txt, confusion_sets_extended.txt, and confusion_sets.README to get 1256 total confusion sets.
Here is the git repo. You can find the confusion-sets in “conf_list.p” python pickle dump.
“eng_sent.txt” contains 921997 sentences from tatoeba. And “incorrect_sentences.txt” is the parallel ‘noisy’ dataset.
Let me know how I can proceed!

dnaber · February 19, 2018, 3:46pm

I think a more viable approach is to focus on a single confusion pair first and make that work. You could generate input by artificially creating errors for that confusion pair.

drex · February 19, 2018, 4:40pm

Hmmmm. Well, I had considered that but I think there will be a problem with training the model then.
Let’s take the case of “your” and “you’re”. If I take some sentences and replace all occurrences of “your” with “you’re” then when you train the seq-2-seq model it will just learn to replace all the "you’re"s with “yours” regardless of the context of the sentence. i,e It won’t even consider the meaning of the sentence, it will just learn to output “your” whenever it sees “you’re”.
It’s something that I noticed before when I used seq-2-seq for MT. Let me know if I am wrong.

dnaber · February 19, 2018, 5:13pm

Wouldn’t this be solved by also doing the opposite? your -> you're and you're -> your (in other sentences) - then you have examples for both cases?

drex · February 19, 2018, 5:17pm

Oh, yeah that true. my bad

oserikov · February 20, 2018, 1:07pm

Hello, my name is Oleg and I hope to participate in GSoC this year.
I’ve read the ideas list and now am really interested in improving the spell checker because of the following: I’ve practiced NLP and spelling correction tools development, am experienced in elasticsearch (and suggest to use elasticsearch in the server-based spell-checker).
Also when thinking about compressing the data when(if) moving the task to the client’s side, the first idea that comes to mind is the usage of the data structures such as suffix trees to deal with the size of the dictionaries.
Now I’m going to find some Java-related bug in the issues list and to fix it, but am happy to discuss stated ideas.

dnaber · February 20, 2018, 1:13pm

Hi Oleg, thanks for your interest in LT. We do already use Lucene in LT, so maybe we can avoid the complexity of ElasticSearch.