Week 6: 10 June
Working on the multiple languages support for the suggestions orderer, hope to deploy tonight.
Found a way to painless use XGBoost with java: jpmml-xgboost.
Working on the multiple languages support for the suggestions orderer, hope to deploy tonight.
Found a way to painless use XGBoost with java: jpmml-xgboost.
Training data preprocessing (took more time than I thought).
Working on the multiple languages support for the suggestions orderer: added mock models for all the languages – only for the true models learning time.
The quality of the rule-specific models.
rule id | accuracy |
---|---|
MORFOLOGIK_RULE_EN_US | 0.9049421571963253 |
MORFOLOGIK_RULE_UK_UA | 0.7677888293802602 |
MORFOLOGIK_RULE_PL_PL | 0.8061653091785675 |
MORFOLOGIK_RULE_RU_RU | 0.7840147872251716 |
MORFOLOGIK_RULE_EN_GB | 0.9112796833773087 |
MORFOLOGIK_RULE_ES | 0.8221775207442796 |
MORFOLOGIK_RULE_CA_ES | 0.7149931224209078 |
MORFOLOGIK_RULE_RO_RO | 0.7615658362989324 |
MORFOLOGIK_RULE_NL_NL | 0.8649409116488463 |
MORFOLOGIK_RULE_IT_IT | 0.8079537237888648 |
MORFOLOGIK_RULE_SK_SK | 0.7133757961783439 |
MORFOLOGIK_RULE_EN_CA | 0.916058394160584 |
MORFOLOGIK_RULE_BR_FR | 0.7222222222222222 |
MORFOLOGIK_RULE_EL_GR | 0.4852941176470588 |
MORFOLOGIK_RULE_EN_AU | 0.9209431345353676 |
MORFOLOGIK_RULE_TL | 0.7744360902255639 |
MORFOLOGIK_RULE_EN_NZ | 0.8695652173913043 |
MORFOLOGIK_RULE_BE_BY | 0.7940647482014388 |
MORFOLOGIK_RULE_AST | 1.0 |
MORFOLOGIK_RULE_EN_ZA | 0.8979591836734694 |
MORFOLOGIK_RULE_SR_EKAVIAN | 1.0 |
MORFOLOGIK_RULE_SL_SI | 0.8928571428571429 |
Will now measure current released solution’s quality.
There is no enough data for some languages to train and validate models, so I’ll group them. Now also will play with grouping all the languages and the subsets. Will also collect and use POS-tags ngrams frequency from the correct sentences data.
Committed a version with PMML syntax-based models, so it now can be installed without painful extra dependencies handling – that was the main problem of the original xgboost models evaluator.
Now the model requires ngram data to work and i’m now working on models for these languages that don’t have ngram data. The integration is almost done.
I’ve also had to spend a week without my laptop. During this time I learned some gradle-maven migration info so I’ve committed a couple of migration steps then. It now builds without errors, but not all the tests are passing and the final .zip package is not created yet.
Is there any code showing how to create a lucene index for ngrams? Does lucene build 1grams 2grams and 3grams just from the text or the frequencies should be counted manually and then given to lucene?
Lucene is very low level, it just takes an ngram and its count, so you need to do everything manually. There’s AggregatedNgramToLucene that takes a text file and turns it into a Lucene index.
Thanks! Will use it to store ngrams of POStags
As your changes have been merged now, do you have an up-to-date evaluation that shows by how much results have been improved due to your changes? Also, are there any performance issues to be expected when the feature is activated?
Also, do you have some specific example where your new code improved suggestion ordering? I’d like to try it.
Please see my review comments at ML-based suggestions ordering (AGPL dependency removed) by oserikov · Pull Request #1115 · languagetool-org/languagetool · GitHub
Will provide the evaluation in the upcoming couple of days.
The evaluations
rule id | accuracy_ml_new | accuracy_current_released_solution |
---|---|---|
MORFOLOGIK_RULE_EN_US | 0.9049392696942199 | 0.7372791981570426 |
MORFOLOGIK_RULE_UK_UA | 0.7864791493603516 | 0.8071081056333842 |
MORFOLOGIK_RULE_PL_PL | 0.8038934620449879 | 0.7868830815863438 |
MORFOLOGIK_RULE_RU_RU | 0.7840144589937147 | 0.8655292229893561 |
MORFOLOGIK_RULE_EN_GB | 0.9112796877257558 | 0.8876274252877447 |
MORFOLOGIK_RULE_ES | 0.8220406822258833 | 0.7980048416935459 |
MORFOLOGIK_RULE_CA_ES | 0.8138392842631836 | 0.8144466213863527 |
MORFOLOGIK_RULE_RO_RO | 0.7792548989332385 | 0.7507114743129868 |
MORFOLOGIK_RULE_NL_NL | 0.8644711747876072 | 0.775136995658376 |
MORFOLOGIK_RULE_IT_IT | 0.8937213852018758 | 0.7995754820037403 |
MORFOLOGIK_RULE_SK_SK | 0.7209596989620216 | 0.6885896726544022 |
MORFOLOGIK_RULE_EN_CA | 0.9192615657923624 | 0.8619113479121857 |
MORFOLOGIK_RULE_BR_FR | 0.7640469106346021 | 0.6888343980787189 |
MORFOLOGIK_RULE_EL_GR | 0.6839547335261514 | 0.44664833026351014 |
MORFOLOGIK_RULE_EN_AU | 0.9209433397606021 | 0.757935159588512 |
MORFOLOGIK_RULE_TL | 0.8096563572006417 | 0.5686071321365249 |
MORFOLOGIK_RULE_EN_NZ | 0.8867790338855502 | 0.8206145253240272 |
MORFOLOGIK_RULE_BE_BY | 0.8071202303110907 | 0.7959978488496677 |
MORFOLOGIK_RULE_AST | 1.0 | 1.0 |
MORFOLOGIK_RULE_EN_ZA | 0.8979523533368317 | 0.6879253794778564 |
MORFOLOGIK_RULE_SR_EKAVIAN | 1.0 | 1.0 |
MORFOLOGIK_RULE_SL_SI | 0.9091255255113319 | 0.7883841170500472 |
AUSTRIAN_GERMAN_SPELLER_RULE | 0.8441627650114452 | 0.7690113874458014 |
FR_SPELLING_RULE | 0.8546088374113552 | 0.8512218322235491 |
GERMAN_SPELLER_RULE | 0.8490684264414519 | 0.7249694043971036 |
Isn’t your code active for German or didn’t you run an evaluation for it?
I have opened some issues for that I think are the remaining issues to activate this feature for the production system:
Edited the post and added missing evaluation.
Thanks, is the code available to re-run the evaluation? What are your future plans, is there a chance you’re going to work on the remaining issues linked above?
During this Summer of Code I worked on several tasks.
First, the improvment of spellchecker suggestions sorting using machine learning approach included the following submissions in the languagetool repository on GitHub:
Code for the model learning part is in this repo.
The ordering of suggestions is now done with the predictions of the trained model (xgboost was used), the quality of the resulting sorting was improved.
Second, switching to the modern server-side framework:
Third, migration from Maven to Gradle:
Other submissions:
I’ m willing to continue contributing to languagetool outside GSoC, in particular I plan to do the following within my project:
Oleg, thanks for taking part in GSoC! We’re looking forward to your future contributions.