Back to LanguageTool Homepage - Privacy - Imprint

Cannot update ngram data to lucene index 6 or later?

I have a frustrating problem. My language tool is installed on opensuse using lucene 8.5.
I cannot use the ngram data since the 1ngrams are in lucene 3 and 3ngrams in lucene 5.
The frustrating part is that I will able to downgrade lucene to 6.6 and upgrade the 3grams.

However, there is no easily availble codec to upgrade the 1grams to a current lucene index formula.
The worst is that with lucene 6.6 I can upgrade the 3grams to 6, but lucene 8.5 can read version 6, but cannot upgrade! The whole system seems be disigned to make version compatibility very difficult.

Language tool will not run without the 1 and 2 ngrams. Is there a way to get a version of the ngrams in in lucene index versions 6 or later?

many thanks wbm

Here are my personal notes on this topic:

Lucene cannot read its own indexes of old Lucene versions. So the index needs to be upgraded. For that, download the binary Lucene distribution and call (upgrading a Lucene 4.x to 5.x in this example):
java -cp ./backward-codecs/lucene-backward-codecs-5.5.5.jar:./core/lucene-core-5.5.5.jar org.apache.lucene.index.IndexUpgrader path-to-index
It seems Lucene version x can only read and upgrade indexes from Lucene version x-1.

Yes - I was successful in updating indexes in 3ngrams doing this. But the problem I had is the 1ngrams and 2ngrams that are in version 3.

I could not find .jar code that would upgrade from 3 to 4? The code seemed to be missing from the apache site, and the maven repo only had the code for later versions.

Is there a way to rebuild the 1ngrams and 2ngrams with a recent version of lucene?

I just checked the version, and this is what I get:

en/1grams 4.10.1
en/2grams 4.9.0 and 4.10.1
en/3grams 5.2.1

So I don’t see Lucene 3.x there. But anyway, I’m not sure about your setup. You say “installed on opensuse using lucene 8.5”, but LanguageTool comes with all its libraries in the correct version, including Lucene 5.5. The Lucene on your operating system should not be relevant. Or is it relevant because you develop your own Java-based software which already uses Lucene 8.5?

The langtool that comes with opensuse 15.1 and works with emacs uses lucene 8.5. I have reindexd up to version 7, but version 8.5 requires the index to be rebuilt from the original data and refuses to use the current archive since it was build with a version below 7 (there is a marker in the file that tells lucene that the file started with a lower version). This is a documented feature of lucene. Langtool works fine in emacs with lucene 8.5. Is there a single archive with the original data to rebuild the indexes? 8.5 is suppose to be faster and so this would lead to better performance? thanks so much wbm.

PS - the person who build the rpm for opensuse 15.1 uses lucene 8.5. If I try to downgrade to lucene 6.6 the package manager removes language tool.

There’s no single archive that can be used to easily re-index the data. I’d suggest contacting the maintainer of the package and let them know LT currently uses Lucene 5.5, using any other version will lead to problems.

Many thanks. I got it to work with lucene 6.6 using archive en-20150817, that I upgraded to version 6 indexes. Langtool in emacs now runs (it is did not run before I upgraded the indexes, so I know it is
finding them). This version does not detect the errors in your test phrases, and so does not work
better than the version on opensuse using lucene 8.5. The upgrade process may have removed some data.

At this point I will give up, though I might get my son (who suppose to be doing computer engineering) to build indexes using lucene 8.5. I assume the best source for this are the 2012 google ngrams you mention on your site? thanks so much for the help. wbm

There was a bug in LT 4.9 which caused some errors not to be found. With a recent snapshot, that should be fixed.

Yes, at least it’s the data used for our ngrams.