Creating n-gram data set instead of using google n-gram data for finding errors

mdasadul · March 16, 2016, 4:20pm

Hello Everyone
I am new language-tool user. I am interested to use language tool with our own n-gram data. I was trying to visualize google n-gram dataset but unable to do that. I know how to create n-gram along with their frequency but inorder to use it by lucene Do I need any particular representation?

Any help will be greatly appreciated

Thanks

dnaber · March 16, 2016, 5:26pm

Hi, thanks for your interest in LanguageTool. You can inspect the existing Google ngram index with Luke (GitHub - DmitryKey/luke: This is mavenised Luke: Lucene Toolbox Project). It will show a very simple structure: each ngram has one document with an ngram field that contains the ngram and a count field with the occurrence count.

Additionally, one document is needed per index that stores the total token count. The key is totalTokenCount and the value is that count. 1grams, 2grams, and 3grams have to be in their own index.

Here’s a code snippet on how to build a single document:

github.com

languagetool-org/languagetool/blob/master/languagetool-dev/src/main/java/org/languagetool/dev/bigdata/FrequencyIndexCreator.java#L287


      
            config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
            //config.setRAMBufferSizeMB(1000);
            Directory directory = FSDirectory.open(indexDir.toPath());
            writer = new IndexWriter(directory, config);
          }
          
          
@Override
          void addDoc(String text, long count) throws IOException {
            if (text.length() > 1000) {
              System.err.println("Ignoring doc, ngram is > 1000 chars: " + text.substring(0, 50) + "...");
            } else {
              Document doc = new Document();
              doc.add(new Field("ngram", text, StringField.TYPE_NOT_STORED));
              FieldType fieldType = new FieldType();
              fieldType.setStored(true);
              Field countField = new Field("count", String.valueOf(count), fieldType);
              doc.add(countField);
              totalTokenCount += count;
              writer.addDocument(doc);
            }
          }

mdasadul · March 16, 2016, 7:24pm

Thanks so much for your reply. I am able to visualize the the data. I want to build similar data set by using my own data. I have data in several text files. Is there a way in Language tool that I can create my own n-gram data set from it?

If not How I can do that ?

Thanks

dnaber · March 16, 2016, 7:49pm

We don’t have supported and documented code for that, but you could try running org.languagetool.dev.bigdata.CommonCrawlToNgram to get ngrams and then org.languagetool.dev.bigdata.AggregatedNgramToLucene to turn that into a Lucene index. The easiest way is probably to run this code directly from a Java IDE. The alternative is to write your own small Java program that creates a Lucene index. How much text data do you have?

mdasadul · March 16, 2016, 8:02pm

Thanks for your valuable suggestion. Yes I think I will write down my own. I have around 500GB raw text which is edited and proof read by professional proof reader

mdasadul · May 25, 2016, 3:18pm

Hello
I am planning to merge lucene indexed google ngram data that I downloaded from heregoogle-ngram with the lucene index created from my own dataset. I am using lucene 5.0.0 for creating index from my data.

I am using following commands
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar org.apache.lucene.misc.IndexMergeTool

I was able to merge 1grams and 2grams folder but when I am trying to merge 3-grams I am getting error
Exception in thread “main” org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(MMapIndexInput(path=“/mnt/LUCENE_INDEX/google_ngram/3grams/segments_1”))): 5 (needs to be between 0 and 4)
at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:217)
at org.apache.lucene.index.SegmentInfos.readCommit(SegmentInfos.java:291)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:427)
at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:424)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:642)
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:594)
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:424)
at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2350)
at org.apache.lucene.misc.IndexMergeTool.main(IndexMergeTool.java:49)

I realize it something to do with Lucene version problem but I can see language tool itself is using lucene 5.2.and it’s look like the google n-gram index was created by lucene 4.10. Is there any other way I can get updated google n-gram index? or is it possible to find raw data from where I will be able to create n-gram index?

Thanks

dnaber · May 25, 2016, 3:57pm

There’s a JAR to support old codecs, maybe it helps if you add that to the classpath: Maven Central Repository Search

mdasadul · May 25, 2016, 4:00pm

I already add that jar
java -cp lucene-core-5.0.0.jar:lucene-backward-codecs-5.0.0.jar:lucene-misc-5.0.0.jar . I was getting error when I was trying to merge 1-gram and suggesting to add that jar so I did and works fine but with 3-grams it’s not working and getting above error

dnaber · May 25, 2016, 4:17pm

The raw data is at Google Ngram Viewer, but it’s huge. Maybe you can iterate over all terms in the index and re-create the same index with a new version of Lucene.

mdasadul · May 25, 2016, 4:18pm

Thanks so much for your help