Confusion rule setup

I do have ngram data sets, but I would not know how to create Lucene indexes from those, nor how to get the confusion rule initiated. The wiki is clear enough for me on this; how would i start org.languagetool.dev.FrequencyIndexCreator for a specific file?

I’ve indexed Ruud’s data and here’s a short documentation. The easiest way is to present the data in a way that org.languagetool.dev.bigdata.FrequencyIndexCreator can understand:

  • Each ngram file consists of lines, each line contains the ngram, followed by a tab, followed by the occurrence count.
  • The data is put in three files:
    • 1/1/1-output.csv.gz (unigrams)
    • 2/2/2-output.csv.gz (bigrams)
    • 3/3/3-output.csv.gz (trigrams)
  • FrequencyIndexCreator is then called three times with these parameters:
    • lucene input/1 output/1grams
    • lucene input/2 output/2grams
    • lucene input/3 output/3grams
  • Finally use org.languagetool.dev.bigdata.NGramLookup on the output of the command above to check some occurrence counts manually.
  • To actually activate the confusion rule, implement getRelevantLanguageModelRules in <Language>.java. See English.java for an Example.

Is there a stand alone app or start option to create these indexes? There will be better versions of the ngram data.
Or does it require lucene to be installed?

Will 4, 5 and higher x-grams be of use?

For all if the other languages that do not have ngrams from Google… I am able to generate ngrams for most languages. The value will depend on the size of the data I collected over the years. But if you want to give it a try, just say so.

There’s currently no stand-alone version of the class that creates the index. Lucene is just a dependency and does not need to be installed, Maven will make sure it’s available.

4grams and 5grams might be of use, but that remains to be tested.

Okay, I will generate up to 7; what to do with it will depend on how big and useful they are.

Is it hard to make such a class into stand-alone tool? I could put my second computer I7 on generating the files then.

I am having academically relevant uni, bi and tri gram dataset with corresponding wikipedia occurrences. I tried using Index of /download/ngram-data/ dataset and I want to use my custom academic dataset with languageTool. I tried creating similar 1-output.tsv.gz file and passed it to FrequencyIndexCreator class file, but it is giving me this message “Skipping 1-output.tsv.gz - doesn’t match regex googlebooks-[a-z]{3}-all-[1-5]gram-20120701-(.*?).gz, [a-z0-9]±[a-z0-9]±[a-z0-9]±[a-z0-9]±[a-z0-9]+_-.gz, or ([a-z0-9]{1,2}|other|pos|punctuation|(ADJ|ADP|ADV|CONJ|DET|NOUN|NUM|PRON|PRT|VERB)_)” .

Could you help me with this query?

Which version of lucene-core and lucene-analyzers-common should I use to create lucene indexes?

I did use version 6.6.1 and created indexes and I am able to do NGramLookup. I did try configuring default N-gram data and It worked but when I provide path to custom N-gram lucene indexes and query it with http://127.0.0.1:8081/v2/check endpoint, it is giving me “HTTP Error 400: Bad request”.

We’re still at Lucene 5.5.5 in LanguageTool. The 400 bad request error doesn’t seem to be related to ngrams, though. What does the LT server print on the command line?

yes, I did find out that. Now I am able to start server with ngram data, for that I am using this command “/usr/bin/java -cp language_tool_python/LanguageTool-4.9/languagetool-server.jar org.languagetool.server.HTTPServer --languageModel path/to/ngrams/data/”.

I am a bit confused with this statement " To actually activate the confusion rule, implement getRelevantLanguageModelRules in <Language>.java . See English.java for an Example." from your older response. Also, I am using a python wrapper “GitHub - jxmorris12/language_tool_python: a free python grammar checker 📝✅” for spell correction.

I want to verify with you the correction of the procedure which I have followed:

  1. G-zipped tsv n-gram data into files ( named like this 1-1-1-1-1-.tsv.gz) (I tried 1-output.tsv.gz name, but it was failing, which I have mentioned in the previous query)
  2. Triggered FrequencyIndexCreator class for uni, bi and tri gram gz files and created corresponding lucene indexes.
  3. Verified occurrences and counts manually using NGramLookup class.
  4. Changed the path of --languageModel flag, by pointing it lucene indexes (stored as ngrams/en/{1grams,2grams,3grams} and started the server. ( changed the code of the python wrapper and added additional --languageModel parameter for this)

My questions are,

  1. Is it the correct way of adding custom n-gram data?
  2. How to verify that n-gram data is correctly configured with the languageTool? Because I tried passing some mistaken sentences from my academic n-gram data and still It is not providing corrections
  3. Do I need to add confusion pairs here? Or Should It work w/o them?

To debug ngrams, I think you’ll need to set DEBUG in org.languagetool.rules.ngrams.ConfusionProbabilityRule and org.languagetool.languagemodel.BaseLanguageModel to true.

Without confusion pairs, the rule will do nothing (but there’s a default set of confusion pairs). You can add pairs to en/confusion_sets.txt.

Does it mean I have to add confusion pairs around my academically relevant data, to use academically relevant N-gram data with LanguageTool?

What kind of errors do you want to detect? The rule only works if it has confusion pair data, otherwise it does nothing.