Back to LanguageTool Homepage - Privacy - Imprint

Creating n-gram data set instead of using google n-gram data for finding errors

Hello Everyone
I am new language-tool user. I am interested to use language tool with our own n-gram data. I was trying to visualize google n-gram dataset but unable to do that. I know how to create n-gram along with their frequency but inorder to use it by lucene Do I need any particular representation?

Any help will be greatly appreciated

Thanks

Hi, thanks for your interest in LanguageTool. You can inspect the existing Google ngram index with Luke (https://github.com/DmitryKey/luke). It will show a very simple structure: each ngram has one document with an ngram field that contains the ngram and a count field with the occurrence count.

Additionally, one document is needed per index that stores the total token count. The key is totalTokenCount and the value is that count. 1grams, 2grams, and 3grams have to be in their own index.

Here’s a code snippet on how to build a single document:

Thanks so much for your reply. I am able to visualize the the data. I want to build similar data set by using my own data. I have data in several text files. Is there a way in Language tool that I can create my own n-gram data set from it?

If not How I can do that ?

Thanks

We don’t have supported and documented code for that, but you could try running org.languagetool.dev.bigdata.CommonCrawlToNgram to get ngrams and then org.languagetool.dev.bigdata.AggregatedNgramToLucene to turn that into a Lucene index. The easiest way is probably to run this code directly from a Java IDE. The alternative is to write your own small Java program that creates a Lucene index. How much text data do you have?

Thanks for your valuable suggestion. Yes I think I will write down my own. I have around 500GB raw text which is edited and proof read by professional proof reader

Hello
I am planning to merge lucene indexed google ngram data that I downloaded from heregoogle-ngram with the lucene index created from my own dataset. I am using lucene 5.0.0 for creating index from my data.

I am using following commands
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar org.apache.lucene.misc.IndexMergeTool

I was able to merge 1grams and 2grams folder but when I am trying to merge 3-grams I am getting error
Exception in thread “main” org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/LUCENE_INDEX/google_ngram/3grams/segments_1"))): 5 (needs to be between 0 and 4)
at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:217)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:427)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:424)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:642)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:594)
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:424)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2350)
at org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:49)

I realize it something to do with Lucene version problem but I can see language tool itself is using lucene 5.2.and it’s look like the google n-gram index was created by lucene 4.10. Is there any other way I can get updated google n-gram index? or is it possible to find raw data from where I will be able to create n-gram index?

Thanks

There’s a JAR to support old codecs, maybe it helps if you add that to the classpath: http://search.maven.org/#search|ga|1|a%3A"lucene-backward-codecs"

I already add that jar
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar . I was getting error when I was trying to merge 1-gram and suggesting to add that jar so I did and works fine but with 3-grams it’s not working and getting above error

The raw data is at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html, but it’s huge. Maybe you can iterate over all terms in the index and re-create the same index with a new version of Lucene.

Thanks so much for your help