[GSoC reports] spellchecker, server-side framework and build tool tasks

Hello,

I will be a part of the LanguageTool team during the Google Summer of Code 2018. The tasks I will work on are:

  • spellchecker suggestions sorting improvement
  • migration to the modern server-side framework
  • migration from Maven to Gradle

In this topic I will post weekly summaries of the work done. I’m also happy to discuss any related thoughts here.

1 Like

Welcome to the LanguageTool community!

Week 1: 23 Apr – 27 Apr

I discussed plans and tasks with my mentor, @Yakov.
I started the development of auxiliary utilities needed for the project.

Thanks for the update. Are you working in a fork? Could you post a link?

Sorry for the delayed response.

Currently my work is not published because it consists mostly of the small parts not being bound together. I’m planning to start committing things at GitHub in the end of upcoming week.

Week 2: 30 Apr – 4 May

During the week passed I’ve came to the complete vision of the task. I’ve also handwritten the training data transformation to features workflow and the algorithm of models scoring, so now I have to implement all these written things.

Week 3: 7 May – 11 May

During the week I’ve received the initial training set to work with using the features extracting tool. It will be extended later to extract more features and to work with different languages.

Week 4: 14 May

Started the development of the suggestions’ scroing model using tensorflow. The notebook will be shared in a couple of days, but I won’t share the training set because of the privacy limitations.

Week 4: 15 May – 16 May

I’ve extracted the same features as after the deadline uses from the dataset using Java (that allows to easily integrate the features extracting step with the LT code). Today I’m going to train the model that will order suggestions.

Week 4: 17 May – 18 May

I’ve trained the first model (colab notebook link). No complicated features engineering and no model tuning were applied, that model is just a first one to work with. It has ~65% accuracy on the task of guessing whether the user will choose the suggestion or not.
The features used are:

  • left and right 3-gram context probability
  • edit distance between an misspelling and the suggestion
  • does the first letter of the suggestion match the first letter of the misspelling

Today I will compare the model with the current approach used by LT using more relevant quality metric (number of times when the top1 suggestion was selected by user) and will work on the integration of the model with the spellchecker’s ordering algorithm.

Week 4: 20 May

Had troubles with accessing my Google Colaboratory notebook, so I’ve set up the Azure Notebooks as the reserve place to work.
Worked mostly on the integration of the TensorFlow model with the LT.

UPD: today colaboratory works fine :\

Week 5: 21 May

I’ve finally compared the current LT’s corrections orderer with the ML-based one. LT’s score is 86% while ML-based orderer has only 70%. The scoring function is the scaled number of times when the top1 suggestion was selected by user.

Now I’m going to

  • finish the integration of the ML-based orderer with the LT
  • analyse the cases where algorithms fail
  • play with model’s architecture
  • do some featues engineering

Week 5: 22 – 24 May

Just could not resist to try to beat current LT’s solution’s score. Latterly I’ve tried xgboost on the same features set and it has >87% what beats the current baseline.
Now I’ll continue to work on the integration of the ML-based solution with the LT and on the other tasks mentioned above.

Week 5: 25 May

Cleaned-up model training code to publish it on github.

Week 6: 28 May – 30 May

Published the model training code on github.

Week 6: 31 May, 1 June, Week 7: 4 June

Preparing the features extracting tool to work with all languages supported by LT.

Week 6: 5 – 6 June

Found a prototype for the keyboard distance feature extractor.
Working on the deployment of the ml-based suggestions orderer-using version of LT on a linux instance in the cloud.

We actually have something like that in our code base at org.languagetool.dev.wordsimilarity.BaseKeyboardDistance (and it sub classes). I’m not sure if it has been tested yet, though.

Week 6: 7 – 8 June

Received the training set for almost all the languages supported
Working on the deployment of the LT version using ML-based spellchecker in the cloud and sources publication.

Week 6: 9 June

Started the ml-based suggestions orderer in the cloud. The api is still the same, so the 18.191.120.72:8081/v2/check endpoint can be used to check text (my favorite example is “A qick brown fox jamps ower the lasy dog”):

oleg@DESKTOP-UFQCH1N:/mnt/c/Users/olegs$ curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog" http://18.191.120.72:8081/v2/check

Now it works only with english texts, will add more languages and publish sources today, the 10th june.