How to convert ngrams to lucene database?

DonFrosto · February 16, 2025, 7:56pm

Hi, I am trying to create my own ngrams. There was similar topic here N-grams for Polish - how to build a Lucene database? but it seems it died.

So according to information there I need ngrams that I downloaded but they need to be converted to lucene database if I am not wrong.

And here is the issue. There are no built tools to do so ready to download.
After searching I found information that it should be possible by using org.languagetool.dev.bigdata.CommonCrawlToNgram and org.languagetool.dev.bigdata.AggregatedNgramToLucene that are available in languagetool-dev-tools.jar.
Still builiding using install.sh script on git website does not create such jar file and building from code using maven fails. I did force the build by skipping tests and got such jar file but it is broken as it throws below errors:

java -cp /home/user/languagetool/languagetool-dev/target/languagetool-dev-6.6-SNAPSHOT.jar org.languagetool.dev.bigdata.CommonCrawlToNgram /home/user/pol/ngrams/* /home/user/pol/mod/
Error: Unable to initialize main class org.languagetool.dev.bigdata.CommonCrawlToNgram
Caused by: java.lang.NoClassDefFoundError: org/languagetool/tokenizers/Tokenizer

and indeed in built jar file there is no such class

Is there really no easier way to create ngrams for languagetool?

dnaber · February 16, 2025, 9:04pm

You might want to open the LT code in an IDE and just run the classes from there. They are in the repo, but they haven’t been touched or maintained for years.

DonFrosto · February 16, 2025, 11:39pm

I did try:

<langCode> <input.xz> <ngramIndexDir> <simpleEvalFile>
<simpleEvalFile> a plain text file with simple error markup

"pl-PL" file_path output_path <this one I do not have>

I am stuck. So far I have no idea where to get or how to generate simpleEvalFile

If only there were ngrams for pl-PL on official website here: Index of /download/ngram-data/

dnaber · February 17, 2025, 9:00am

This format is documented here: languagetool/languagetool-dev/src/main/java/org/languagetool/dev/errorcorpus/SimpleCorpus.java at master · languagetool-org/languagetool · GitHub