Back to LanguageTool Homepage - Privacy - Imprint

Creating n-gram data set instead of using google n-gram data for finding errors


(Md Asadul Islam) #1

Hello Everyone
I am new language-tool user. I am interested to use language tool with our own n-gram data. I was trying to visualize google n-gram dataset but unable to do that. I know how to create n-gram along with their frequency but inorder to use it by lucene Do I need any particular representation?

Any help will be greatly appreciated

Thanks


(Daniel Naber) #2

Hi, thanks for your interest in LanguageTool. You can inspect the existing Google ngram index with Luke (https://github.com/DmitryKey/luke). It will show a very simple structure: each ngram has one document with an ngram field that contains the ngram and a count field with the occurrence count.

Additionally, one document is needed per index that stores the total token count. The key is totalTokenCount and the value is that count. 1grams, 2grams, and 3grams have to be in their own index.

Here's a code snippet on how to build a single document:


(Md Asadul Islam) #3

Thanks so much for your reply. I am able to visualize the the data. I want to build similar data set by using my own data. I have data in several text files. Is there a way in Language tool that I can create my own n-gram data set from it?

If not How I can do that ?

Thanks


(Daniel Naber) #4

We don't have supported and documented code for that, but you could try running org.languagetool.dev.bigdata.CommonCrawlToNgram to get ngrams and then org.languagetool.dev.bigdata.AggregatedNgramToLucene to turn that into a Lucene index. The easiest way is probably to run this code directly from a Java IDE. The alternative is to write your own small Java program that creates a Lucene index. How much text data do you have?


(Md Asadul Islam) #5

Thanks for your valuable suggestion. Yes I think I will write down my own. I have around 500GB raw text which is edited and proof read by professional proof reader


(Md Asadul Islam) #6

Hello
I am planning to merge lucene indexed google ngram data that I downloaded from heregoogle-ngram with the lucene index created from my own dataset. I am using lucene 5.0.0 for creating index from my data.

I am using following commands
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar org.apache.lucene.misc.IndexMergeTool

I was able to merge 1grams and 2grams folder but when I am trying to merge 3-grams I am getting error
Exception in thread "main" org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path="/mnt/LUCENE_INDEX/google_ngram/3grams/segments_1"))): 5 (needs to be between 0 and 4)
at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:217)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:427)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:424)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:642)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:594)
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:424)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2350)
at org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:49)

I realize it something to do with Lucene version problem but I can see language tool itself is using lucene 5.2.and it's look like the google n-gram index was created by lucene 4.10. Is there any other way I can get updated google n-gram index? or is it possible to find raw data from where I will be able to create n-gram index?

Thanks


(Daniel Naber) #7

There's a JAR to support old codecs, maybe it helps if you add that to the classpath: http://search.maven.org/#search|ga|1|a%3A%22lucene-backward-codecs%22


(Md Asadul Islam) #8

I already add that jar
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar . I was getting error when I was trying to merge 1-gram and suggesting to add that jar so I did and works fine but with 3-grams it's not working and getting above error


(Daniel Naber) #9

The raw data is at http://storage.googleapis.com/books/ngrams/books/datasetsv2.html, but it's huge. Maybe you can iterate over all terms in the index and re-create the same index with a new version of Lucene.


(Md Asadul Islam) #10

Thanks so much for your help