Back to LanguageTool Homepage - Privacy - Imprint

[GSoC reports] spellchecker, server-side framework and build tool tasks

(Oleg) #11

Week 4: 20 May

Had troubles with accessing my Google Colaboratory notebook, so I’ve set up the Azure Notebooks as the reserve place to work.
Worked mostly on the integration of the TensorFlow model with the LT.

UPD: today colaboratory works fine :\

(Oleg) #12

Week 5: 21 May

I’ve finally compared the current LT’s corrections orderer with the ML-based one. LT’s score is 86% while ML-based orderer has only 70%. The scoring function is the scaled number of times when the top1 suggestion was selected by user.

Now I’m going to

  • finish the integration of the ML-based orderer with the LT
  • analyse the cases where algorithms fail
  • play with model’s architecture
  • do some featues engineering

(Oleg) #13

Week 5: 22 – 24 May

Just could not resist to try to beat current LT’s solution’s score. Latterly I’ve tried xgboost on the same features set and it has >87% what beats the current baseline.
Now I’ll continue to work on the integration of the ML-based solution with the LT and on the other tasks mentioned above.

(Oleg) #14

Week 5: 25 May

Cleaned-up model training code to publish it on github.

(Oleg) #15

Week 6: 28 May – 30 May

Published the model training code on github.

(Oleg) #16

Week 6: 31 May, 1 June, Week 7: 4 June

Preparing the features extracting tool to work with all languages supported by LT.

(Oleg) #17

Week 6: 5 – 6 June

Found a prototype for the keyboard distance feature extractor.
Working on the deployment of the ml-based suggestions orderer-using version of LT on a linux instance in the cloud.

(Daniel Naber) #18

We actually have something like that in our code base at (and it sub classes). I’m not sure if it has been tested yet, though.

(Oleg) #19

Week 6: 7 – 8 June

Received the training set for almost all the languages supported
Working on the deployment of the LT version using ML-based spellchecker in the cloud and sources publication.

(Oleg) #20

Week 6: 9 June

Started the ml-based suggestions orderer in the cloud. The api is still the same, so the endpoint can be used to check text (my favorite example is “A qick brown fox jamps ower the lasy dog”):

oleg@DESKTOP-UFQCH1N:/mnt/c/Users/olegs$ curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog"

Now it works only with english texts, will add more languages and publish sources today, the 10th june.

(Daniel Naber) #21

Thanks! For those who want to compare with the current state, you can use this command:

curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog"

On Linux, you can use json_pp to pretty-print the result, i.e.

curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog" | json_pp

Trying this manually is of limited use of course - what’s interesting will be the evaluation based on real data.

(Yakov) #22

Working well.
I also try it with our firefox extension, but it works only over https…

(Oleg) #23

The evaluation on the real hold-out data shows >87% accuracy when the current’s deployed solution has ~86%. That does not take in account the distance of the correct suggestion from the first position, so I’ll try to find a pretty way to use and display that info etc.

Will try to enable ssl but it requires some extra time to create a (self-signed I think) certificate etc.

(Daniel Naber) #24

It’s not that difficult using

(Oleg) #25

Thanks, I’ll use it!

(Oleg) #26

Week 6: 10 June

Working on the multiple languages support for the suggestions orderer, hope to deploy tonight.
Found a way to painless use XGBoost with java: jpmml-xgboost.

(Oleg) #27

Week 7: 11 June

Training data preprocessing (took more time than I thought).
Working on the multiple languages support for the suggestions orderer: added mock models for all the languages – only for the true models learning time.

(Oleg) #28

Week 7: 12 June – 15 June

  • Studying jpmml-xgboost
  • Working on features extractor update
  • Training the models

(Oleg) #29

The quality of the rule-specific models.

rule id accuracy
MORFOLOGIK_RULE_EN_US 0.9049421571963253
MORFOLOGIK_RULE_UK_UA 0.7677888293802602
MORFOLOGIK_RULE_PL_PL 0.8061653091785675
MORFOLOGIK_RULE_RU_RU 0.7840147872251716
MORFOLOGIK_RULE_EN_GB 0.9112796833773087
MORFOLOGIK_RULE_ES 0.8221775207442796
MORFOLOGIK_RULE_CA_ES 0.7149931224209078
MORFOLOGIK_RULE_RO_RO 0.7615658362989324
MORFOLOGIK_RULE_NL_NL 0.8649409116488463
MORFOLOGIK_RULE_IT_IT 0.8079537237888648
MORFOLOGIK_RULE_SK_SK 0.7133757961783439
MORFOLOGIK_RULE_EN_CA 0.916058394160584
MORFOLOGIK_RULE_BR_FR 0.7222222222222222
MORFOLOGIK_RULE_EL_GR 0.4852941176470588
MORFOLOGIK_RULE_EN_AU 0.9209431345353676
MORFOLOGIK_RULE_TL 0.7744360902255639
MORFOLOGIK_RULE_EN_NZ 0.8695652173913043
MORFOLOGIK_RULE_BE_BY 0.7940647482014388
MORFOLOGIK_RULE_EN_ZA 0.8979591836734694
MORFOLOGIK_RULE_SL_SI 0.8928571428571429

Will now measure current released solution’s quality.
There is no enough data for some languages to train and validate models, so I’ll group them. Now also will play with grouping all the languages and the subsets. Will also collect and use POS-tags ngrams frequency from the correct sentences data.

(Oleg) #30

Committed a version with PMML syntax-based models, so it now can be installed without painful extra dependencies handling – that was the main problem of the original xgboost models evaluator.
Now the model requires ngram data to work and i’m now working on models for these languages that don’t have ngram data. The integration is almost done.


  • finish the automatical ngram data presence handling – if there is no ngram data for the language, the proper model should be chosen automatically
  • finish models for the languages not using ngram data
  • improve all the models

I’ve also had to spend a week without my laptop. During this time I learned some gradle-maven migration info so I’ve committed a couple of migration steps then. It now builds without errors, but not all the tests are passing and the final .zip package is not created yet.