Hi, I am trying to create my own ngrams. There was similar topic here N-grams for Polish - how to build a Lucene database? but it seems it died.
So according to information there I need ngrams that I downloaded but they need to be converted to lucene database if I am not wrong.
And here is the issue. There are no built tools to do so ready to download.
After searching I found information that it should be possible by using org.languagetool.dev.bigdata.CommonCrawlToNgram
and org.languagetool.dev.bigdata.AggregatedNgramToLucene
that are available in languagetool-dev-tools.jar
.
Still builiding using install.sh script on git website does not create such jar file and building from code using maven fails. I did force the build by skipping tests and got such jar file but it is broken as it throws below errors:
java -cp /home/user/languagetool/languagetool-dev/target/languagetool-dev-6.6-SNAPSHOT.jar org.languagetool.dev.bigdata.CommonCrawlToNgram /home/user/pol/ngrams/* /home/user/pol/mod/
Error: Unable to initialize main class org.languagetool.dev.bigdata.CommonCrawlToNgram
Caused by: java.lang.NoClassDefFoundError: org/languagetool/tokenizers/Tokenizer
and indeed in built jar file there is no such class
Is there really no easier way to create ngrams for languagetool?