To add pairs of words to confusion_sets.txt, a factor is necessary. To find that factor, “Open ConfusionRuleEvaluator.java in your IDE and set TOKEN and TOKEN_HOMOPHONE to the two words that can easily be confused. Then run the main method of the class” (Adding N Gram Data Rules - LanguageTool Wiki).
Is it possible to find the factor using the command line? If yes, how?
I’ve changed the code a bit to make it easier, but it’s still not trivial. So this is how the JAR needed can be built with a developer set-up, but I will later today sent you a JAR directly:
mvn clean compile assembly:single
Then, in target there’s a file called languagetool-dev-3.6-SNAPSHOT-jar-with-dependencies.jar which can be run like this:
It will print the exact usage. To work, it requires not only the pair of words to be checked (like “their” and “there”) but also a plain text file with example sentences that contain these words. I get these examples by running the Unix command grep on a list of sentences extracted from Wikipedia and tatoeba. You can also specify a Wikipedia XML, but then the whole XML will need to be scanned for example sentences and everything will become much slower.
So in a nutshell, if you just have a few words you can also send them to me and I’ll run this process.
(Stationary/stationery and desert/dessert are already in en/confusion_sets.txt.)
Although @kostyfisik would like rules in the grammar file, I think that the statistical method will probably give useful messages. For example, with stationery/stationary:
LT gives a message: This car is stationery.
LT does not give a message: This book is stationery.
LT does not give a message: This book is stationary.
Thanks, I’ve added borrow/lend and land/lend to our confusion list. For walkaround/workaround there are not enough examples in Wikipedia and Tatoeba, so I’ve skipped those.
HI! I have tried to compile jar with dependencies for the new 4.2 version of languagetool, but mvn clean compile assembly:single fails at building the languagetool-parent 4.2-SNAPSHOT:
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project languagetool-parent: Error reading assemblies: No assembly descriptors found. -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:single (default-cli) on project languagetool-parent: Error reading assemblies: No assembly descriptors found.
Caused by: org.apache.maven.plugin.MojoExecutionException: Error reading assemblies: No assembly descriptors found.
Caused by: org.apache.maven.plugin.assembly.io.AssemblyReadException: No assembly descriptors found.
If it would be helpful, I can post the whole stack trace. My maven and java versions are
I’ve just done a dictionary search on “stationery”.
seems it refers to mass-produced commercial writing material.
so an exercise schoolbook is stationery that might be stationary.
Everything worked after I did a fresh clone of the repository. Maven version does not appear to be an issue.
I have planned to use it to evaluate Russian confusion pairs, but it seems that ConfusionRuleEvaluator works only for English. Even if I change the source to accept russian language model it complains about Lucene50 codec, which seems strange since current russian ngram model works with the standalone package. Will keep digging, thanks!
If this is the level of ease of use for getting the parameters, I would be happy to have this word confusion addition for Dutch. I don’t know how much work it is to add it codewise, and what will be needed to train the AI.