N-Gram question

helz · June 18, 2017, 10:48am

Hi,

I’m using my own server with the n-gram data. I downloaded this archive: http://languagetool.org/download/ngram-data/untested/ngram-ru-20150914.zip

When i use command like this:

java -cp languagetool.jar org.languagetool.dev.NGramStats /media/8E9ED52D9ED50E99/ngram-ru-20150914/ru “some_wrong_word”

it returns 0 occurrences and that’s correct. But when i run server like this:

java -cp languagetool-server.jar org.languagetool.server.HTTPServer --config server.properties --port 8082 --allow-origin ‘*’

and checking same “wrong_word” there is no error. LT just returns that everything is ok.

Do you have any idea why it can happen?

P.S.
I’m using latest version on LT
And here is my “server.properties” settings:

languageModel=/media/8E9ED52D9ED50E99/ngram-ru-20150914/

Thanks!

dnaber · June 18, 2017, 11:57am

The ngrams only work in context, they are not related to all the other checks (like spell check). So if “wrongword” is accepted that’s because the spellchecker has it in its dictionary. If you think the word should not be in the dictionary, please open a bug report at Issues · languagetool-org/languagetool · GitHub

helz · June 18, 2017, 1:00pm

Yeah, you’re right. Sorry for the confusion. My “wrongword” is not actually wrong. It may be wrong based on a word before it.

So, here is how i check it:

java -cp languagetool.jar org.languagetool.dev.NGramStats /media/8E9ED52D9ED50E99/ngram-ru-20150914/ru “context_word_1 verifiable_word” - returns n occurrences and that’s right

java -cp languagetool.jar org.languagetool.dev.NGramStats /media/8E9ED52D9ED50E99/ngram-ru-20150914/ru “context_word_2 verifiable_word” - returns 0 occurrences and that’s right too, bacause in this case it means that there is no such combination of words

I also checked both combinations here: https://languagetool.org/
And didn’t get correct result. But i’m sure there is error in one of these cases

dnaber · June 18, 2017, 1:26pm

The Russian confusion rule so far only has two pairs that are checked (не/ни and шасси/шоссе), so if your word isn’t one of those, the ngram rule is active at all for your case. (We only check specific hand-chosen pairs to avoid getting too many false alarms.)

helz · June 18, 2017, 1:42pm

Got it. Thanks!