I do have ngram data sets, but I would not know how to create Lucene indexes from those, nor how to get the confusion rule initiated. The wiki is clear enough for me on this; how would i start org.languagetool.dev.FrequencyIndexCreator for a specific file?
I’ve indexed Ruud’s data and here’s a short documentation. The easiest way is to present the data in a way that
org.languagetool.dev.bigdata.FrequencyIndexCreator can understand:
- Each ngram file consists of lines, each line contains the ngram, followed by a tab, followed by the occurrence count.
- The data is put in three files:
FrequencyIndexCreatoris then called three times with these parameters:
lucene input/1 output/1grams
lucene input/2 output/2grams
lucene input/3 output/3grams
- Finally use
org.languagetool.dev.bigdata.NGramLookupon the output of the command above to check some occurrence counts manually.
- To actually activate the confusion rule, implement
English.javafor an Example.
Is there a stand alone app or start option to create these indexes? There will be better versions of the ngram data.
Or does it require lucene to be installed?
Will 4, 5 and higher x-grams be of use?
For all if the other languages that do not have ngrams from Google… I am able to generate ngrams for most languages. The value will depend on the size of the data I collected over the years. But if you want to give it a try, just say so.
There’s currently no stand-alone version of the class that creates the index. Lucene is just a dependency and does not need to be installed, Maven will make sure it’s available.
4grams and 5grams might be of use, but that remains to be tested.
Okay, I will generate up to 7; what to do with it will depend on how big and useful they are.
Is it hard to make such a class into stand-alone tool? I could put my second computer I7 on generating the files then.