Hi Daniel,
You said that improves error detection of some words that are easily confused, and I’m very interested in the technical details, and see Finding errors using Big Data - LanguageTool Wiki, but still confused. I found EnglishConfusionProbabilityRule does not work in stand-alone version, I want to konw how to detected those words if they are confused in stand-alone version use the big-data.
Also, I can not open the page of rule editor today, what’s wrong with it?
The confusion rule currently only works in server mode and for the command-line version (Command-Line Options - LanguageTool Wiki, see option “–languagemodel”).
Yes the standalone version does not support NGrams, by default.
However, I found that if you “hack the code” and add the following (inserting the correct path to your NGram Database)
I’ve added this code to the showOptions() routine, since I don’t want it enabled by default.
Hence, the NGram rule is only enabled when I go to the option dialogue box.
In version 2.8 I integrated the enabling of the NGram database fully into the setting options, however I’ve not migrated the code to version 3 as yet.
Just wondering if it might be worth considering, if the ngram functionality could be integrated into the core functionality of the languagetool code. Since I feel it could be useful if some of the n-gram functionality could be used in conjunction with the standard java/xml rules.
For example if a rule identifies a possible error it could use the confusion probability rule to compare the original text with any possible alternatives identified. However, for this to be useful you’ll probably need a bit more than the 3n-gram version.
An example application could be detecting a missing definite/indefinite article (determiner)
Yes one option would be to sort the rule’s suggestions.
However, with a rule which might generate incorrect suggestions. You could reject the rule if the suggested correction had a low ngram probability.
for example
Did you mean the ?
Accept rule the if the 3ngram probability is more than say 0.4, or the difference in probability between the original text and the suggested correction is greater than a specified value.
Hi just spotted that you’ve made it possible to configure the ngram directory via the GUI, back in June.
One comment is that if you move your ngram folder it’s a little tricky open the GUI, so I’ve added a try catch block around the activateLanguageModelRules call in reloadLanguageTool
if (config.getNgramDirectory() != null) {
try {
languageTool.activateLanguageModelRules(config.getNgramDirectory());
}
catch (IOException e) {
JOptionPane.showMessageDialog(null, “IO error while loading ngram database.\n” + e.getMessage());
}
catch (RuntimeException e) {
JOptionPane.showMessageDialog(null, “Error while loading ngram database.\n” + e.getMessage());
}
}
<rule id="PREFER_TO_VBG" name="prefer to vbg(vb)">
<pattern>
<token>prefer</token>
<token>to</token>
<marker>
<token postag='VBG'></token>
</marker>
</pattern>
<message>Did you mean <suggestion><match no="3" postag="VB"/></suggestion>?</message>
<example correction=''>Some other people prefer to <marker>changing</marker> job.</example>
<example>Some other people prefer to change job.</example>
</rule>
<!-- English rule, 2015-07-08 -->
<rule id="SOME_NEW_FIND" name="some new find(finds)">
<pattern>
<token regexp='yes'>some</token>
<token >new</token>
<marker>
<token>find</token>
</marker>
</pattern>
<message>Did you mean <suggestion><match no="3" postag="NNS"/></suggestion>?</message>
<example correction=''>They can create some new <marker>find</marker> in theories and factories and achieve success in this job.</example>
<example>They can create some new finds in theories and factories and achieve success in this job.</example>
</rule>
I’ve pushed my changes to master.
Sorry about the mistake with formatting initially, I didn’t spot that eclipse automatic changes the indentation on pasting. Note to oneself is to change eclipse’s default indentation settings next time.
Thanks, I’ve added the first rule. About the second one, I’m not so sure. First, it’s very specific and I also get a lot of matches for “some new find” in Google and I don’t think they are all wrong.