Back to LanguageTool Homepage - Privacy - Imprint

Confusion rule setup

(Ruud Baars) #1

I do have ngram data sets, but I would not know how to create Lucene indexes from those, nor how to get the confusion rule initiated. The wiki is clear enough for me on this; how would i start for a specific file?

(Daniel Naber) #2

I’ve indexed Ruud’s data and here’s a short documentation. The easiest way is to present the data in a way that can understand:

  • Each ngram file consists of lines, each line contains the ngram, followed by a tab, followed by the occurrence count.
  • The data is put in three files:
    • 1/1/1-output.csv.gz (unigrams)
    • 2/2/2-output.csv.gz (bigrams)
    • 3/3/3-output.csv.gz (trigrams)
  • FrequencyIndexCreator is then called three times with these parameters:
    • lucene input/1 output/1grams
    • lucene input/2 output/2grams
    • lucene input/3 output/3grams
  • Finally use on the output of the command above to check some occurrence counts manually.
  • To actually activate the confusion rule, implement getRelevantLanguageModelRules in <Language>.java. See for an Example.

(Ruud Baars) #3

Is there a stand alone app or start option to create these indexes? There will be better versions of the ngram data.
Or does it require lucene to be installed?

Will 4, 5 and higher x-grams be of use?

For all if the other languages that do not have ngrams from Google… I am able to generate ngrams for most languages. The value will depend on the size of the data I collected over the years. But if you want to give it a try, just say so.

(Daniel Naber) #4

There’s currently no stand-alone version of the class that creates the index. Lucene is just a dependency and does not need to be installed, Maven will make sure it’s available.

4grams and 5grams might be of use, but that remains to be tested.

(Ruud Baars) #5

Okay, I will generate up to 7; what to do with it will depend on how big and useful they are.

Is it hard to make such a class into stand-alone tool? I could put my second computer I7 on generating the files then.