LT and GSoC 2018 - looking for students

dnaber · February 12, 2018, 5:55pm

I’m happy to announce that LanguageTool has been accepted as a mentoring organization for Google Summer of Code 2018 (GSoC). Soon, students will look for interesting tasks in LT, and some of them will also appear here on the forum. Let’s help them find something exciting to work on!

Are you a student? See here for information on how to get started

In case you don’t know GSoC: it’s a program sponsored by Google that pays students for working on Open Source software. The students apply to organizations like LanguageTool, and if they are selected, they can work almost full time on LanguageTool for 3 months. Google pays them something around USD 2400-6000 when they successfully finish their tasks.

Ruud_Baars · February 13, 2018, 11:12am

I don’t know if it is exciting, but maybe this is a chance to have a new suggestion approach to words?
All I would need is :
word,flags,suggestions.
e.g.:

BTW:oh:btw
meaning: BTW is an optional way of writing the word btw.
The rule should be able to use the flag (much like postag, but it is of course not a postag)

or

rijwiel:lf:fiets
using a rule creating the message: using ‘fiets’ makes the text less formal.

vikrant97 · February 18, 2018, 7:58am

I don’t know whether LT is looking for deep learning proposals for its spell checker tool or not, but still I have something in mind that I could do for GSoC’18. I am familiar with CNN & RNN’s. And I think I can build a model that can take a sentence with spelling mistakes as input, and output the same sentence, but with the mistakes corrected using RNN’s. I can do this python using tensorflow. But I wander if this is needed by LT or not.
If the org is interested in this I can write a good proposal regarding this. It would be very interesting for me. Thanks.

Ruud_Baars · February 18, 2018, 10:08am

This would be very interesting for Dutch at least, not for simple spelling issues, but for spellchecking multi word groups. Especially as it could simply use pairs wrong and corrected sentences as training material.

dnaber · February 18, 2018, 10:15am

Yes, we’re very much interested in this. For non ML programming, we usually expect students to submit a pull request with a small bug fix or small feature, do you think there’s any equivalent for ML?

Vishakha · February 18, 2018, 11:56am

Hey! My name is Vishakha and I am a GSOC '18 aspirant. Learning that AI is the main approach for languagetool to detect errors and correct them more efficiently got me intrigued. As a part of a project that I recently worked on, I used interesting ML algorithms for classification. The crux of the process was to create triples of sentences and their relationships as POS(Parts of Sentence). I used the triples over some trained sentences to analyse if they were questions, statements(pertaining to the application) or chat sentences. I have a strong belief that this method could be extended to train on correct sentences to identify errors over test sentences. I used JAVA based dependecies to extract the POS tags of words. This application can be achieved with the methodological use of Neural Networks. Moreover, Tensorflow could be a great tool to go ahead with this application. Does that sound like something that could be taken up? I see that the existing issues at wiki.languagetool.org are related to web development. Is there any other platform to go through the existing AI approach of languagetool where I could learn and contribute? Thank you!

dnaber · February 18, 2018, 2:28pm

Hi Vishakha, thanks for your interest in LanguageTool! Did you have a look at this thread? I think this is close to what you’re suggesting? So what we need is something that’s the next step: either extend the existing approach to also cover multi-token confusion (like your vs. you're), or a seq2seq approach that can find errors that are more complex than just work confusions.

drex · February 18, 2018, 4:54pm

Hey,
In the LT Ideas Page I came across the “Extend AI approach” task and it seems like something that I can contribute to as I am currently doing my research in encoder-decoder models. Is there any dataset that I can use to start training the seq-2-seq model on?

dnaber · February 18, 2018, 7:19pm

There’s no public data set, but we collect corrections from users who allow that. Anyway, the first thing I’d try is to generate errors, e.g. take sentences with “your” (e.g. from tatoeba) and replace it with “you’re” and you have incorrect sentences. As there might be errors in tatoeba, this might not work in 100% of the cases, but let’s hope the approach is robust enough to deal with that.

crypticmyna · February 18, 2018, 9:53pm

Hello sir!
I was wondering if LanguageTool was willing to give shelter to a completely new language. I am quite proficient in Hindi language and would love to help make it a part of LanguageTool family!

Also, I am really interested in Machine Learning algorithms like CNN etc.

dnaber · February 18, 2018, 10:17pm

Hi, welcome to LanguageTool. Yes, adding a new language could be part of a GSoC project. However, as adding a new language does not necessarily require a lot of programming, it would probably not be enough on its own. If you read this thread, you’ll find some links to the wiki with more ideas. And machine learning is interesting, we’ll need a good plan on how to approach it, though.

Ruud_Baars · February 19, 2018, 7:25am

Daniel, We could have several approaches. There is the 5-grams to 1 word confusion now. I can imagine one for sentence ending detection as well, and one for postagging, as well as a wrong sentence to right sentence approach. These could be different projects.

Hsankesara · February 19, 2018, 10:14am

Hello all, My name is Heet, and I am very much interested in LanguageTool and want to work in this organization. I am interested as well as experienced in ML, NLP and TensorFlow. I recently created a mail classifier which directs the mail just by reading its content to the concerning department of a company as my college project. It is a beginners project I know and I really want to contribute in this organization to learn and implement new ML algorithms. I know both python and java.

dnaber · February 19, 2018, 2:30pm

Hi Heet, thanks for your interest in LT. It would be great if you - and everybody else interested in ML - can come up with a more detailed plan and some prototype code for your ideas. In the end, you’ll need to write an application anyway. This quite a bit of work, but it vastly increases your chances of being selected for GSoC at LT.

drex · February 19, 2018, 3:01pm

So, I found some confusion sets here
. I combined all the pairs in confusion_sets.txt, confusion_set_candidates.txt, confusion_sets_extended.txt, and confusion_sets.README to get 1256 total confusion sets.
Here is the git repo. You can find the confusion-sets in “conf_list.p” python pickle dump.
“eng_sent.txt” contains 921997 sentences from tatoeba. And “incorrect_sentences.txt” is the parallel ‘noisy’ dataset.
Let me know how I can proceed!

dnaber · February 19, 2018, 3:46pm

I think a more viable approach is to focus on a single confusion pair first and make that work. You could generate input by artificially creating errors for that confusion pair.

drex · February 19, 2018, 4:40pm

Hmmmm. Well, I had considered that but I think there will be a problem with training the model then.
Let’s take the case of “your” and “you’re”. If I take some sentences and replace all occurrences of “your” with “you’re” then when you train the seq-2-seq model it will just learn to replace all the "you’re"s with “yours” regardless of the context of the sentence. i,e It won’t even consider the meaning of the sentence, it will just learn to output “your” whenever it sees “you’re”.
It’s something that I noticed before when I used seq-2-seq for MT. Let me know if I am wrong.

dnaber · February 19, 2018, 5:13pm

Wouldn’t this be solved by also doing the opposite? your -> you're and you're -> your (in other sentences) - then you have examples for both cases?

drex · February 19, 2018, 5:17pm

Oh, yeah that true. my bad