[GSoC reports] spellchecker, server-side framework and build tool tasks

Week 6: 10 June

Working on the multiple languages support for the suggestions orderer, hope to deploy tonight.
Found a way to painless use XGBoost with java: jpmml-xgboost.

Week 7: 11 June

Training data preprocessing (took more time than I thought).
Working on the multiple languages support for the suggestions orderer: added mock models for all the languages – only for the true models learning time.

Week 7: 12 June – 15 June

  • Studying jpmml-xgboost
  • Working on features extractor update
  • Training the models

The quality of the rule-specific models.

rule id accuracy
MORFOLOGIK_RULE_EN_US 0.9049421571963253
MORFOLOGIK_RULE_UK_UA 0.7677888293802602
MORFOLOGIK_RULE_PL_PL 0.8061653091785675
MORFOLOGIK_RULE_RU_RU 0.7840147872251716
MORFOLOGIK_RULE_EN_GB 0.9112796833773087
MORFOLOGIK_RULE_ES 0.8221775207442796
MORFOLOGIK_RULE_CA_ES 0.7149931224209078
MORFOLOGIK_RULE_RO_RO 0.7615658362989324
MORFOLOGIK_RULE_NL_NL 0.8649409116488463
MORFOLOGIK_RULE_IT_IT 0.8079537237888648
MORFOLOGIK_RULE_SK_SK 0.7133757961783439
MORFOLOGIK_RULE_EN_CA 0.916058394160584
MORFOLOGIK_RULE_BR_FR 0.7222222222222222
MORFOLOGIK_RULE_EL_GR 0.4852941176470588
MORFOLOGIK_RULE_EN_AU 0.9209431345353676
MORFOLOGIK_RULE_TL 0.7744360902255639
MORFOLOGIK_RULE_EN_NZ 0.8695652173913043
MORFOLOGIK_RULE_BE_BY 0.7940647482014388
MORFOLOGIK_RULE_AST 1.0
MORFOLOGIK_RULE_EN_ZA 0.8979591836734694
MORFOLOGIK_RULE_SR_EKAVIAN 1.0
MORFOLOGIK_RULE_SL_SI 0.8928571428571429

Will now measure current released solution’s quality.
There is no enough data for some languages to train and validate models, so I’ll group them. Now also will play with grouping all the languages and the subsets. Will also collect and use POS-tags ngrams frequency from the correct sentences data.

Committed a version with PMML syntax-based models, so it now can be installed without painful extra dependencies handling – that was the main problem of the original xgboost models evaluator.
Now the model requires ngram data to work and i’m now working on models for these languages that don’t have ngram data. The integration is almost done.

todo

  • finish the automatical ngram data presence handling – if there is no ngram data for the language, the proper model should be chosen automatically
  • finish models for the languages not using ngram data
  • improve all the models

I’ve also had to spend a week without my laptop. During this time I learned some gradle-maven migration info so I’ve committed a couple of migration steps then. It now builds without errors, but not all the tests are passing and the final .zip package is not created yet.

Is there any code showing how to create a lucene index for ngrams? Does lucene build 1grams 2grams and 3grams just from the text or the frequencies should be counted manually and then given to lucene?

Lucene is very low level, it just takes an ngram and its count, so you need to do everything manually. There’s AggregatedNgramToLucene that takes a text file and turns it into a Lucene index.

Thanks! Will use it to store ngrams of POStags

As your changes have been merged now, do you have an up-to-date evaluation that shows by how much results have been improved due to your changes? Also, are there any performance issues to be expected when the feature is activated?

Also, do you have some specific example where your new code improved suggestion ordering? I’d like to try it.

Please see my review comments at ML-based suggestions ordering (AGPL dependency removed) by oserikov · Pull Request #1115 · languagetool-org/languagetool · GitHub

Will provide the evaluation in the upcoming couple of days.

The evaluations

rule id accuracy_ml_new accuracy_current_released_solution
MORFOLOGIK_RULE_EN_US 0.9049392696942199 0.7372791981570426
MORFOLOGIK_RULE_UK_UA 0.7864791493603516 0.8071081056333842
MORFOLOGIK_RULE_PL_PL 0.8038934620449879 0.7868830815863438
MORFOLOGIK_RULE_RU_RU 0.7840144589937147 0.8655292229893561
MORFOLOGIK_RULE_EN_GB 0.9112796877257558 0.8876274252877447
MORFOLOGIK_RULE_ES 0.8220406822258833 0.7980048416935459
MORFOLOGIK_RULE_CA_ES 0.8138392842631836 0.8144466213863527
MORFOLOGIK_RULE_RO_RO 0.7792548989332385 0.7507114743129868
MORFOLOGIK_RULE_NL_NL 0.8644711747876072 0.775136995658376
MORFOLOGIK_RULE_IT_IT 0.8937213852018758 0.7995754820037403
MORFOLOGIK_RULE_SK_SK 0.7209596989620216 0.6885896726544022
MORFOLOGIK_RULE_EN_CA 0.9192615657923624 0.8619113479121857
MORFOLOGIK_RULE_BR_FR 0.7640469106346021 0.6888343980787189
MORFOLOGIK_RULE_EL_GR 0.6839547335261514 0.44664833026351014
MORFOLOGIK_RULE_EN_AU 0.9209433397606021 0.757935159588512
MORFOLOGIK_RULE_TL 0.8096563572006417 0.5686071321365249
MORFOLOGIK_RULE_EN_NZ 0.8867790338855502 0.8206145253240272
MORFOLOGIK_RULE_BE_BY 0.8071202303110907 0.7959978488496677
MORFOLOGIK_RULE_AST 1.0 1.0
MORFOLOGIK_RULE_EN_ZA 0.8979523533368317 0.6879253794778564
MORFOLOGIK_RULE_SR_EKAVIAN 1.0 1.0
MORFOLOGIK_RULE_SL_SI 0.9091255255113319 0.7883841170500472
AUSTRIAN_GERMAN_SPELLER_RULE 0.8441627650114452 0.7690113874458014
FR_SPELLING_RULE 0.8546088374113552 0.8512218322235491
GERMAN_SPELLER_RULE 0.8490684264414519 0.7249694043971036

Isn’t your code active for German or didn’t you run an evaluation for it?

I have opened some issues for that I think are the remaining issues to activate this feature for the production system:

Edited the post and added missing evaluation.

Thanks, is the code available to re-run the evaluation? What are your future plans, is there a chance you’re going to work on the remaining issues linked above?

GSoC 2018 Work Summary

What was done

During this Summer of Code I worked on several tasks.
First, the improvment of spellchecker suggestions sorting using machine learning approach included the following submissions in the languagetool repository on GitHub:

Code for the model learning part is in this repo.
The ordering of suggestions is now done with the predictions of the trained model (xgboost was used), the quality of the resulting sorting was improved.

Second, switching to the modern server-side framework:

Third, migration from Maven to Gradle:

Other submissions:

Future works

I’ m willing to continue contributing to languagetool outside GSoC, in particular I plan to do the following within my project:

  • further improve the ml model quality (parameter tuning, feature engineering, adding new features)
  • finish transition to Gradle
  • finish transition to Spring
  • adress all suggested corrections and get open PRs merged.

Oleg, thanks for taking part in GSoC! We’re looking forward to your future contributions.

@oserikov Can you already estimate when you might be able to work on these issues?