N-grams for Polish - how to build a Lucene database?

LeslieFH · October 17, 2024, 7:17pm

There’s a 7 GB zip file with a set of n-grams for Polish created by the Wrocław University for Science and Technology, with a CreativeCommons BY-SA 4.0 license. It has 2-grams, 3-grams and even some 4-grams, but they’re all in .txt format, as tables of words.

https://zasobynauki.pl/zasoby/n-gramy-jezykowe,18469/

How do I convert this into a lucene database usable by LanguageTool?

dnaber · October 17, 2024, 7:26pm

You might find background information at Finding errors using Big Data - LanguageTool Wiki. Please be aware that having ngrams available won’t do anything - making use of the in LT requires significant investment besides building the ngram data.