How to index my own ngram files?

rudolfs · June 13, 2024, 7:16am

I am experimenting with ngrams for Latvian language, but there are no ready-made google ngrams for it, so I’m trying to make my own data, however I am having issues making the lucene index.

How do I build the index? I have tried to use Luke to create the index, but it either crashes or gives an empty index. Maybe there are some scripts/programs that were used to build the index for those languages that are supported? I don’t have experience with programming in Java and I wasn’t able to find any information on what is LT expecting to find in that index.

dnaber · June 13, 2024, 7:18am

The only documentation we have is at Finding errors using Big Data - LanguageTool Wiki, but it’s probably outdated.