[GSoC reports] spellchecker, server-side framework and build tool tasks

oserikov · April 29, 2018, 12:03am

Hello,

I will be a part of the LanguageTool team during the Google Summer of Code 2018. The tasks I will work on are:

spellchecker suggestions sorting improvement
migration to the modern server-side framework
migration from Maven to Gradle

In this topic I will post weekly summaries of the work done. I’m also happy to discuss any related thoughts here.

Jan_Schreiber · April 29, 2018, 11:25am

Welcome to the LanguageTool community!

oserikov · April 29, 2018, 4:23pm

Week 1: 23 Apr – 27 Apr

I discussed plans and tasks with my mentor, @Yakov.
I started the development of auxiliary utilities needed for the project.

dnaber · April 29, 2018, 8:03pm

Thanks for the update. Are you working in a fork? Could you post a link?

oserikov · May 6, 2018, 3:58pm

Sorry for the delayed response.

Currently my work is not published because it consists mostly of the small parts not being bound together. I’m planning to start committing things at GitHub in the end of upcoming week.

oserikov · May 6, 2018, 4:01pm

Week 2: 30 Apr – 4 May

During the week passed I’ve came to the complete vision of the task. I’ve also handwritten the training data transformation to features workflow and the algorithm of models scoring, so now I have to implement all these written things.

oserikov · May 15, 2018, 11:27am

Week 3: 7 May – 11 May

During the week I’ve received the initial training set to work with using the features extracting tool. It will be extended later to extract more features and to work with different languages.

oserikov · May 15, 2018, 11:31am

Week 4: 14 May

Started the development of the suggestions’ scroing model using tensorflow. The notebook will be shared in a couple of days, but I won’t share the training set because of the privacy limitations.

oserikov · May 17, 2018, 2:11pm

Week 4: 15 May – 16 May

I’ve extracted the same features as after the deadline uses from the dataset using Java (that allows to easily integrate the features extracting step with the LT code). Today I’m going to train the model that will order suggestions.

oserikov · May 20, 2018, 2:56am

Week 4: 17 May – 18 May

I’ve trained the first model (colab notebook link). No complicated features engineering and no model tuning were applied, that model is just a first one to work with. It has ~65% accuracy on the task of guessing whether the user will choose the suggestion or not.
The features used are:

left and right 3-gram context probability
edit distance between an misspelling and the suggestion
does the first letter of the suggestion match the first letter of the misspelling

Today I will compare the model with the current approach used by LT using more relevant quality metric (number of times when the top1 suggestion was selected by user) and will work on the integration of the model with the spellchecker’s ordering algorithm.

oserikov · May 21, 2018, 1:38pm

Week 4: 20 May

Had troubles with accessing my Google Colaboratory notebook, so I’ve set up the Azure Notebooks as the reserve place to work.
Worked mostly on the integration of the TensorFlow model with the LT.

UPD: today colaboratory works fine :\

oserikov · May 22, 2018, 12:06pm

Week 5: 21 May

I’ve finally compared the current LT’s corrections orderer with the ML-based one. LT’s score is 86% while ML-based orderer has only 70%. The scoring function is the scaled number of times when the top1 suggestion was selected by user.

Now I’m going to

finish the integration of the ML-based orderer with the LT
analyse the cases where algorithms fail
play with model’s architecture
do some featues engineering

oserikov · May 25, 2018, 9:44am

Week 5: 22 – 24 May

Just could not resist to try to beat current LT’s solution’s score. Latterly I’ve tried xgboost on the same features set and it has >87% what beats the current baseline.
Now I’ll continue to work on the integration of the ML-based solution with the LT and on the other tasks mentioned above.

oserikov · May 31, 2018, 4:45am

Week 5: 25 May

Cleaned-up model training code to publish it on github.

oserikov · May 31, 2018, 4:46am

Week 6: 28 May – 30 May

Published the model training code on github.

oserikov · June 5, 2018, 7:15am

Week 6: 31 May, 1 June, Week 7: 4 June

Preparing the features extracting tool to work with all languages supported by LT.

oserikov · June 7, 2018, 10:33pm

Week 6: 5 – 6 June

Found a prototype for the keyboard distance feature extractor.
Working on the deployment of the ml-based suggestions orderer-using version of LT on a linux instance in the cloud.

dnaber · June 8, 2018, 8:07am

We actually have something like that in our code base at org.languagetool.dev.wordsimilarity.BaseKeyboardDistance (and it sub classes). I’m not sure if it has been tested yet, though.

oserikov · June 9, 2018, 10:09pm

Week 6: 7 – 8 June

Received the training set for almost all the languages supported
Working on the deployment of the LT version using ML-based spellchecker in the cloud and sources publication.

oserikov · June 10, 2018, 3:37am

Week 6: 9 June

Started the ml-based suggestions orderer in the cloud. The api is still the same, so the 18.191.120.72:8081/v2/check endpoint can be used to check text (my favorite example is “A qick brown fox jamps ower the lasy dog”):

oleg@DESKTOP-UFQCH1N:/mnt/c/Users/olegs$ curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog" http://18.191.120.72:8081/v2/check

Now it works only with english texts, will add more languages and publish sources today, the 10th june.