[GSoC reports] spellchecker, server-side framework and build tool tasks

dnaber · June 10, 2018, 7:07am

Thanks! For those who want to compare with the current state, you can use this command:

curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog" https://languagetool.org/api/v2/check

On Linux, you can use json_pp to pretty-print the result, i.e.

curl --data "language=en-US&text=A qick brown fox jamps ower the lasy dog" https://languagetool.org/api/v2/check | json_pp

Trying this manually is of limited use of course - what’s interesting will be the evaluation based on real data.

Yakov · June 10, 2018, 8:56am

Thanks!
Working well.
I also try it with our firefox extension, but it works only over https…

oserikov · June 10, 2018, 5:51pm

The evaluation on the real hold-out data shows >87% accuracy when the current LT.org’s deployed solution has ~86%. That does not take in account the distance of the correct suggestion from the first position, so I’ll try to find a pretty way to use and display that info etc.

Will try to enable ssl but it requires some extra time to create a (self-signed I think) certificate etc.

dnaber · June 10, 2018, 6:00pm

It’s not that difficult using https://letsencrypt.org

oserikov · June 10, 2018, 8:08pm

Thanks, I’ll use it!

oserikov · June 10, 2018, 9:11pm

Week 6: 10 June

Working on the multiple languages support for the suggestions orderer, hope to deploy tonight.
Found a way to painless use XGBoost with java: jpmml-xgboost.

oserikov · June 12, 2018, 5:40pm

Week 7: 11 June

Training data preprocessing (took more time than I thought).
Working on the multiple languages support for the suggestions orderer: added mock models for all the languages – only for the true models learning time.

oserikov · June 19, 2018, 5:12pm

Week 7: 12 June – 15 June

Studying jpmml-xgboost
Working on features extractor update
Training the models

oserikov · June 27, 2018, 3:05pm

The quality of the rule-specific models.

rule id	accuracy
MORFOLOGIK_RULE_EN_US	0.9049421571963253
MORFOLOGIK_RULE_UK_UA	0.7677888293802602
MORFOLOGIK_RULE_PL_PL	0.8061653091785675
MORFOLOGIK_RULE_RU_RU	0.7840147872251716
MORFOLOGIK_RULE_EN_GB	0.9112796833773087
MORFOLOGIK_RULE_ES	0.8221775207442796
MORFOLOGIK_RULE_CA_ES	0.7149931224209078
MORFOLOGIK_RULE_RO_RO	0.7615658362989324
MORFOLOGIK_RULE_NL_NL	0.8649409116488463
MORFOLOGIK_RULE_IT_IT	0.8079537237888648
MORFOLOGIK_RULE_SK_SK	0.7133757961783439
MORFOLOGIK_RULE_EN_CA	0.916058394160584
MORFOLOGIK_RULE_BR_FR	0.7222222222222222
MORFOLOGIK_RULE_EL_GR	0.4852941176470588
MORFOLOGIK_RULE_EN_AU	0.9209431345353676
MORFOLOGIK_RULE_TL	0.7744360902255639
MORFOLOGIK_RULE_EN_NZ	0.8695652173913043
MORFOLOGIK_RULE_BE_BY	0.7940647482014388
MORFOLOGIK_RULE_AST	1.0
MORFOLOGIK_RULE_EN_ZA	0.8979591836734694
MORFOLOGIK_RULE_SR_EKAVIAN	1.0
MORFOLOGIK_RULE_SL_SI	0.8928571428571429

Will now measure current released solution’s quality.
There is no enough data for some languages to train and validate models, so I’ll group them. Now also will play with grouping all the languages and the subsets. Will also collect and use POS-tags ngrams frequency from the correct sentences data.

oserikov · July 12, 2018, 7:12am

Committed a version with PMML syntax-based models, so it now can be installed without painful extra dependencies handling – that was the main problem of the original xgboost models evaluator.
Now the model requires ngram data to work and i’m now working on models for these languages that don’t have ngram data. The integration is almost done.

todo

finish the automatical ngram data presence handling – if there is no ngram data for the language, the proper model should be chosen automatically
finish models for the languages not using ngram data
improve all the models

I’ve also had to spend a week without my laptop. During this time I learned some gradle-maven migration info so I’ve committed a couple of migration steps then. It now builds without errors, but not all the tests are passing and the final .zip package is not created yet.

oserikov · July 20, 2018, 8:25am

Is there any code showing how to create a lucene index for ngrams? Does lucene build 1grams 2grams and 3grams just from the text or the frequencies should be counted manually and then given to lucene?

dnaber · July 20, 2018, 8:40am

Lucene is very low level, it just takes an ngram and its count, so you need to do everything manually. There’s AggregatedNgramToLucene that takes a text file and turns it into a Lucene index.

oserikov · July 20, 2018, 8:42am

Thanks! Will use it to store ngrams of POStags

dnaber · July 27, 2018, 6:38am

As your changes have been merged now, do you have an up-to-date evaluation that shows by how much results have been improved due to your changes? Also, are there any performance issues to be expected when the feature is activated?

dnaber · July 27, 2018, 8:06am

Also, do you have some specific example where your new code improved suggestion ordering? I’d like to try it.

dnaber · July 27, 2018, 8:19am

Please see my review comments at ML-based suggestions ordering (AGPL dependency removed) by oserikov · Pull Request #1115 · languagetool-org/languagetool · GitHub

oserikov · July 29, 2018, 6:56pm

Will provide the evaluation in the upcoming couple of days.

oserikov · August 5, 2018, 6:08pm

The evaluations

rule id	accuracy_ml_new	accuracy_current_released_solution
MORFOLOGIK_RULE_EN_US	0.9049392696942199	0.7372791981570426
MORFOLOGIK_RULE_UK_UA	0.7864791493603516	0.8071081056333842
MORFOLOGIK_RULE_PL_PL	0.8038934620449879	0.7868830815863438
MORFOLOGIK_RULE_RU_RU	0.7840144589937147	0.8655292229893561
MORFOLOGIK_RULE_EN_GB	0.9112796877257558	0.8876274252877447
MORFOLOGIK_RULE_ES	0.8220406822258833	0.7980048416935459
MORFOLOGIK_RULE_CA_ES	0.8138392842631836	0.8144466213863527
MORFOLOGIK_RULE_RO_RO	0.7792548989332385	0.7507114743129868
MORFOLOGIK_RULE_NL_NL	0.8644711747876072	0.775136995658376
MORFOLOGIK_RULE_IT_IT	0.8937213852018758	0.7995754820037403
MORFOLOGIK_RULE_SK_SK	0.7209596989620216	0.6885896726544022
MORFOLOGIK_RULE_EN_CA	0.9192615657923624	0.8619113479121857
MORFOLOGIK_RULE_BR_FR	0.7640469106346021	0.6888343980787189
MORFOLOGIK_RULE_EL_GR	0.6839547335261514	0.44664833026351014
MORFOLOGIK_RULE_EN_AU	0.9209433397606021	0.757935159588512
MORFOLOGIK_RULE_TL	0.8096563572006417	0.5686071321365249
MORFOLOGIK_RULE_EN_NZ	0.8867790338855502	0.8206145253240272
MORFOLOGIK_RULE_BE_BY	0.8071202303110907	0.7959978488496677
MORFOLOGIK_RULE_AST	1.0	1.0
MORFOLOGIK_RULE_EN_ZA	0.8979523533368317	0.6879253794778564
MORFOLOGIK_RULE_SR_EKAVIAN	1.0	1.0
MORFOLOGIK_RULE_SL_SI	0.9091255255113319	0.7883841170500472
AUSTRIAN_GERMAN_SPELLER_RULE	0.8441627650114452	0.7690113874458014
FR_SPELLING_RULE	0.8546088374113552	0.8512218322235491
GERMAN_SPELLER_RULE	0.8490684264414519	0.7249694043971036

dnaber · August 5, 2018, 6:17pm

Isn’t your code active for German or didn’t you run an evaluation for it?

dnaber · August 8, 2018, 1:51pm

I have opened some issues for that I think are the remaining issues to activate this feature for the production system:

[GSoC] clean up of ML-based suggestions ordering · Issue #1135 · languagetool-org/languagetool · GitHub
[GSoC] ML-based suggestions ordering · Issue #1136 · languagetool-org/languagetool · GitHub
[GSoC] ML-based suggestions ordering - activate for German · Issue #1137 · languagetool-org/languagetool · GitHub
[GSoC] ML-based suggestions ordering: activate everywhere · Issue #1138 · languagetool-org/languagetool · GitHub