Run ConfusionRuleEvaluator from the command line?

To add pairs of words to confusion_sets.txt, a factor is necessary. To find that factor, “Open ConfusionRuleEvaluator.java in your IDE and set TOKEN and TOKEN_HOMOPHONE to the two words that can easily be confused. Then run the main method of the class” (Adding N Gram Data Rules - LanguageTool Wiki).

Is it possible to find the factor using the command line? If yes, how?

I’ve changed the code a bit to make it easier, but it’s still not trivial. So this is how the JAR needed can be built with a developer set-up, but I will later today sent you a JAR directly:

mvn clean compile assembly:single

Then, in target there’s a file called languagetool-dev-3.6-SNAPSHOT-jar-with-dependencies.jar which can be run like this:

java -cp languagetool-dev-3.6-SNAPSHOT-jar-with-dependencies.jar org.languagetool.dev.bigdata.ConfusionRuleEvaluator

It will print the exact usage. To work, it requires not only the pair of words to be checked (like “their” and “there”) but also a plain text file with example sentences that contain these words. I get these examples by running the Unix command grep on a list of sentences extracted from Wikipedia and tatoeba. You can also specify a Wikipedia XML, but then the whole XML will need to be scanned for example sentences and everything will become much slower.

So in a nutshell, if you just have a few words you can also send them to me and I’ll run this process.

The word pairs (refer to [en] walkaround and workaround · Issue #502 · languagetool-org/languagetool · GitHub) are:
walkaround/workaround
land/lend
borrow/lend

(Stationary/stationery and desert/dessert are already in en/confusion_sets.txt.)

Although @kostyfisik would like rules in the grammar file, I think that the statistical method will probably give useful messages. For example, with stationery/stationary:
LT gives a message: This car is stationery.
LT does not give a message: This book is stationery.
LT does not give a message: This book is stationary.

I’ll try these word pairs later today. In case you or someone is interested in doing it yourself, I’ve put the JAR here: Daniel Naber (116MB)

Thanks, I’ve added borrow/lend and land/lend to our confusion list. For walkaround/workaround there are not enough examples in Wikipedia and Tatoeba, so I’ve skipped those.

How many examples of sentences with pairs of words required?

The more the better the evaluation results. I suggest using at least 50 sentences per word, i.e. 100 per word pair.

HI! I have tried to compile jar with dependencies for the new 4.2 version of languagetool, but mvn clean compile assembly:single fails at building the languagetool-parent 4.2-SNAPSHOT:

[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project languagetool-parent: Error reading assemblies: No assembly descriptors found. -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project languagetool-parent: Error reading assemblies: No assembly descriptors found.

Caused by: org.apache.maven.plugin.MojoExecutionException: Error reading assemblies: No assembly descriptors found.
Caused by: org.apache.maven.plugin.assembly.io.AssemblyReadException: No assembly descriptors found.

If it would be helpful, I can post the whole stack trace. My maven and java versions are

Apache Maven 3.5.4 (1edded0938998edf8bf061f1ceb3cfdeccf443fe; 2018-06-17T21:33:14+03:00)
Java version: 1.8.0_112, vendor: Oracle Corporation

Was there any change in how to run ConfusionRuleEvaluator since the version 3.6? Thanks!

mvn clean compile assembly:single works for me with Maven 3.3.9. Maybe you need to run mvn install -DskipTest in the LT top-level directory first?

I’ve just done a dictionary search on “stationery”.
seems it refers to mass-produced commercial writing material.
so an exercise schoolbook is stationery that might be stationary.

Thanks for the quick response!

Everything worked after I did a fresh clone of the repository. Maven version does not appear to be an issue.

I have planned to use it to evaluate Russian confusion pairs, but it seems that ConfusionRuleEvaluator works only for English. Even if I change the source to accept russian language model it complains about Lucene50 codec, which seems strange since current russian ngram model works with the standalone package. Will keep digging, thanks!

This is what I have in the rough data for walkaround and workaround taaltik.xs4all.nl/transfer/x_around.zip

If this is the level of ease of use for getting the parameters, I would be happy to have this word confusion addition for Dutch. I don’t know how much work it is to add it codewise, and what will be needed to train the AI.