I have a frustrating problem. My language tool is installed on opensuse using lucene 8.5.
I cannot use the ngram data since the 1ngrams are in lucene 3 and 3ngrams in lucene 5.
The frustrating part is that I will able to downgrade lucene to 6.6 and upgrade the 3grams.
However, there is no easily availble codec to upgrade the 1grams to a current lucene index formula.
The worst is that with lucene 6.6 I can upgrade the 3grams to 6, but lucene 8.5 can read version 6, but cannot upgrade! The whole system seems be disigned to make version compatibility very difficult.
Language tool will not run without the 1 and 2 ngrams. Is there a way to get a version of the ngrams in in lucene index versions 6 or later?
Lucene cannot read its own indexes of old Lucene versions. So the index needs to be upgraded. For that, download the binary Lucene distribution and call (upgrading a Lucene 4.x to 5.x in this example): java -cp ./backward-codecs/lucene-backward-codecs-5.5.5.jar:./core/lucene-core-5.5.5.jar org.apache.lucene.index.IndexUpgrader path-to-index
It seems Lucene version x can only read and upgrade indexes from Lucene version x-1.
Yes - I was successful in updating indexes in 3ngrams doing this. But the problem I had is the 1ngrams and 2ngrams that are in version 3.
I could not find .jar code that would upgrade from 3 to 4? The code seemed to be missing from the apache site, and the maven repo only had the code for later versions.
Is there a way to rebuild the 1ngrams and 2ngrams with a recent version of lucene?
I just checked the version, and this is what I get:
en/1grams 4.10.1
en/2grams 4.9.0 and 4.10.1
en/3grams 5.2.1
So I don’t see Lucene 3.x there. But anyway, I’m not sure about your setup. You say “installed on opensuse using lucene 8.5”, but LanguageTool comes with all its libraries in the correct version, including Lucene 5.5. The Lucene on your operating system should not be relevant. Or is it relevant because you develop your own Java-based software which already uses Lucene 8.5?
The langtool that comes with opensuse 15.1 and works with emacs uses lucene 8.5. I have reindexd up to version 7, but version 8.5 requires the index to be rebuilt from the original data and refuses to use the current archive since it was build with a version below 7 (there is a marker in the file that tells lucene that the file started with a lower version). This is a documented feature of lucene. Langtool works fine in emacs with lucene 8.5. Is there a single archive with the original data to rebuild the indexes? 8.5 is suppose to be faster and so this would lead to better performance? thanks so much wbm.
There’s no single archive that can be used to easily re-index the data. I’d suggest contacting the maintainer of the package and let them know LT currently uses Lucene 5.5, using any other version will lead to problems.
Many thanks. I got it to work with lucene 6.6 using archive en-20150817, that I upgraded to version 6 indexes. Langtool in emacs now runs (it is did not run before I upgraded the indexes, so I know it is
finding them). This version does not detect the errors in your test phrases, and so does not work
better than the version on opensuse using lucene 8.5. The upgrade process may have removed some data.
At this point I will give up, though I might get my son (who suppose to be doing computer engineering) to build indexes using lucene 8.5. I assume the best source for this are the 2012 google ngrams you mention on your site? thanks so much for the help. wbm
I proposed a pull-request to update LT with Lucene 8.11.3 at PR#10810
I also try to update ngram data for Lucene 8 with format version 8, and you can try it from a GitHub repository miurahr:lt-8-ngram-data which is converted from en-20150817