The evaluation on the real hold-out data shows >87% accuracy when the current LT.org’s deployed solution has ~86%. That does not take in account the distance of the correct suggestion from the first position, so I’ll try to find a pretty way to use and display that info etc.
Will try to enable ssl but it requires some extra time to create a (self-signed I think) certificate etc.
Working on the multiple languages support for the suggestions orderer, hope to deploy tonight.
Found a way to painless use XGBoost with java: jpmml-xgboost.
Training data preprocessing (took more time than I thought).
Working on the multiple languages support for the suggestions orderer: added mock models for all the languages – only for the true models learning time.
Will now measure current released solution’s quality.
There is no enough data for some languages to train and validate models, so I’ll group them. Now also will play with grouping all the languages and the subsets. Will also collect and use POS-tags ngrams frequency from the correct sentences data.
Committed a version with PMML syntax-based models, so it now can be installed without painful extra dependencies handling – that was the main problem of the original xgboost models evaluator.
Now the model requires ngram data to work and i’m now working on models for these languages that don’t have ngram data. The integration is almost done.
todo
finish the automatical ngram data presence handling – if there is no ngram data for the language, the proper model should be chosen automatically
finish models for the languages not using ngram data
improve all the models
I’ve also had to spend a week without my laptop. During this time I learned some gradle-maven migration info so I’ve committed a couple of migration steps then. It now builds without errors, but not all the tests are passing and the final .zip package is not created yet.
Is there any code showing how to create a lucene index for ngrams? Does lucene build 1grams 2grams and 3grams just from the text or the frequencies should be counted manually and then given to lucene?
Lucene is very low level, it just takes an ngram and its count, so you need to do everything manually. There’s AggregatedNgramToLucene that takes a text file and turns it into a Lucene index.
As your changes have been merged now, do you have an up-to-date evaluation that shows by how much results have been improved due to your changes? Also, are there any performance issues to be expected when the feature is activated?