I am new language-tool user. I am interested to use language tool with our own n-gram data. I was trying to visualize google n-gram dataset but unable to do that. I know how to create n-gram along with their frequency but inorder to use it by lucene Do I need any particular representation?
Hi, thanks for your interest in LanguageTool. You can inspect the existing Google ngram index with Luke (https://github.com/DmitryKey/luke). It will show a very simple structure: each ngram has one document with an ngram field that contains the ngram and a count field with the occurrence count.
Additionally, one document is needed per index that stores the total token count. The key is totalTokenCount and the value is that count. 1grams, 2grams, and 3grams have to be in their own index.
Here’s a code snippet on how to build a single document:
Thanks so much for your reply. I am able to visualize the the data. I want to build similar data set by using my own data. I have data in several text files. Is there a way in Language tool that I can create my own n-gram data set from it?
We don’t have supported and documented code for that, but you could try running org.languagetool.dev.bigdata.CommonCrawlToNgram to get ngrams and then org.languagetool.dev.bigdata.AggregatedNgramToLucene to turn that into a Lucene index. The easiest way is probably to run this code directly from a Java IDE. The alternative is to write your own small Java program that creates a Lucene index. How much text data do you have?
I am planning to merge lucene indexed google ngram data that I downloaded from heregoogle-ngram with the lucene index created from my own dataset. I am using lucene 5.0.0 for creating index from my data.
I am using following commands
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar org.apache.lucene.misc.IndexMergeTool
I was able to merge 1grams and 2grams folder but when I am trying to merge 3-grams I am getting error
Exception in thread “main” org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/LUCENE_INDEX/google_ngram/3grams/segments_1"))): 5 (needs to be between 0 and 4)
I realize it something to do with Lucene version problem but I can see language tool itself is using lucene 5.2.and it’s look like the google n-gram index was created by lucene 4.10. Is there any other way I can get updated google n-gram index? or is it possible to find raw data from where I will be able to create n-gram index?
I already add that jar
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar . I was getting error when I was trying to merge 1-gram and suggesting to add that jar so I did and works fine but with 3-grams it’s not working and getting above error