Currently my work is not published because it consists mostly of the small parts not being bound together. I’m planning to start committing things at GitHub in the end of upcoming week.
During the week passed I’ve came to the complete vision of the task. I’ve also handwritten the training data transformation to features workflow and the algorithm of models scoring, so now I have to implement all these written things.
During the week I’ve received the initial training set to work with using the features extracting tool. It will be extended later to extract more features and to work with different languages.
Started the development of the suggestions’ scroing model using tensorflow. The notebook will be shared in a couple of days, but I won’t share the training set because of the privacy limitations.
I’ve extracted the same features as after the deadline uses from the dataset using Java (that allows to easily integrate the features extracting step with the LT code). Today I’m going to train the model that will order suggestions.
I’ve trained the first model (colab notebook link). No complicated features engineering and no model tuning were applied, that model is just a first one to work with. It has ~65% accuracy on the task of guessing whether the user will choose the suggestion or not.
The features used are:
left and right 3-gram context probability
edit distance between an misspelling and the suggestion
does the first letter of the suggestion match the first letter of the misspelling
Today I will compare the model with the current approach used by LT using more relevant quality metric (number of times when the top1 suggestion was selected by user) and will work on the integration of the model with the spellchecker’s ordering algorithm.
Had troubles with accessing my Google Colaboratory notebook, so I’ve set up the Azure Notebooks as the reserve place to work.
Worked mostly on the integration of the TensorFlow model with the LT.
I’ve finally compared the current LT’s corrections orderer with the ML-based one. LT’s score is 86% while ML-based orderer has only 70%. The scoring function is the scaled number of times when the top1 suggestion was selected by user.
Now I’m going to
finish the integration of the ML-based orderer with the LT
Just could not resist to try to beat current LT’s solution’s score. Latterly I’ve tried xgboost on the same features set and it has >87% what beats the current baseline.
Now I’ll continue to work on the integration of the ML-based solution with the LT and on the other tasks mentioned above.
Found a prototype for the keyboard distance feature extractor.
Working on the deployment of the ml-based suggestions orderer-using version of LT on a linux instance in the cloud.
We actually have something like that in our code base at org.languagetool.dev.wordsimilarity.BaseKeyboardDistance (and it sub classes). I’m not sure if it has been tested yet, though.
Received the training set for almost all the languages supported
Working on the deployment of the LT version using ML-based spellchecker in the cloud and sources publication.
Started the ml-based suggestions orderer in the cloud. The api is still the same, so the 18.191.120.72:8081/v2/check endpoint can be used to check text (my favorite example is “A qick brown fox jamps ower the lasy dog”):
oleg@DESKTOP-UFQCH1N:/mnt/c/Users/olegs$ curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog" http://18.191.120.72:8081/v2/check
Now it works only with english texts, will add more languages and publish sources today, the 10th june.