En n-gram data

Aimily · December 3, 2015, 8:30am

Hi, I have downloaded the ngram data. The index of 3grams is so big that it isn’t opened in lukeall, I want to know what’s the lucene vesion. Is it 5.2.1? How to print the gram field of 3grams when I use lucene search. Whether the gram field doesn’t user analyzer. How to sort the searched result by count?

dnaber · December 3, 2015, 8:42am

We’re using Lucene 5.2.1 in LanguageTool currently. I don’t remember what Lucene version was used to create the index, but 5.2.1 should be able to open it. You cannot print the ngram as it’s not stored, you can only search for ngrams. The ngram field doesn’t use an analyzer so you usually only get back one result (so sorting doesn’t make much sense).

Aimily · December 4, 2015, 12:51am

Thank you for your reply.

------------------ Original ------------------
From: “dnaber [via LanguageTool User Forum]”;ml-node+s2306527n4643439h22@n4.nabble.com;
Date: Thu, Dec 3, 2015 04:42 PM
To: "风信子"519600259@qq.com;

Subject: Re: en n-gram data

We’re using Lucene 5.2.1 in LanguageTool currently. I don’t remember what Lucene version was used to create the index, but 5.2.1 should be able to open it. You cannot print the ngram as it’s not stored, you can only search for ngrams. The ngram field doesn’t use an analyzer so you usually only get back one result (so sorting doesn’t make much sense).

If you reply to this email, your message will be added to the discussion below:
http://languagetool-user-forum.2306527.n4.nabble.com/en-n-gram-data-tp4643438p4643439.html
To unsubscribe from en n-gram data, click here.
NAML

Aimily · December 7, 2015, 1:52am

Hi,dnaber
Which version LT used original data fom Google?
Thanks,
Regards
Aimily

dnaber · December 7, 2015, 7:43am

I’m not sure I understand your question. LT always uses a Lucene index of the Google data, it cannot directly use the Google data in its original format.

Aimily · December 8, 2015, 2:24am

AS I see, “As of 20150617, the data used by LT is a mixup of v2 (1gram, 2gram)
and v1 (3gram), see above.”,“org.languagetool.dev.FrequencyIndexCreator can be used to build
a Lucene index with ngrams and their occurrence count.” on the website. I don’t feel fully use the FrequencyIndexCreator class to index. I confused the class use StandardAnalyzer.

dnaber · December 8, 2015, 8:09am

I’ve updated the wiki: the ngram indexes from August (the latest ones) are based on v2 of the Google data.

Mility · January 4, 2016, 12:35pm

Hi, Dnaber
I used the index of 3grams is “ngrams-20140910”, if I want to search the ngrams count, which one of the field name should I use?

dnaber · January 4, 2016, 6:56pm

The field ngram contains the term, the field count contains its occurrence count. The field totalTokenCount contains the total token count (usually there’s only one document with that field).