We’ve just activated an update for English that improves error detection of some words that are easily confused. The list of words is going to be extended, but for now they are:
accept, except
ate, eight
extent, extend
four, for
know, now
nice, mice
pray, prey
their, there
you, your
rite, right
This doesn’t mean that all the errors where e.g. ‘their’ and ‘there’ are mixed up are detected, but most (>75%). As this feature requires huge amounts of disk space, it is available only on https://languagetool.org, not in the stand-alone version (unless you’re willing to download a few GB of data).
Thanks, where say “Automatically Create Rules” in Finding errors using Big Data - LanguageTool Wiki , I want to know , is it like ruleEditor2, or something like input the data could automatically generate grammar rule like in grammar.xml?
Hi,
I’ve been testing this functionality for a few weeks now and I feel it could be quite useful.
A few suggestions would be to give the user the ability to configure the attributes MIN_SCORE_DIFF and MIN_ALTERNATIVE_SCORE of the ConfusionProbabilityRule rule via the command line interface.
Also with the test for finding alternative suggestions, see function getBetterAlternativeOrNull of ConfusionProbabilityRule (Note I’ve been looking at version 2.8/2.9 code)
There’s this if condition… if (alternativeScore >= bestScore + MIN_SCORE_DIFF && alternativeScore >= MIN_ALTERNATIVE_SCORE)
I found that if bestScore has a small value, i.e. the current word has a very low prob, and the alternativeScore never exceeded MIN_ALTERNATIVE_SCORE you wouldn’t get any alternative suggestions.
In my hacked version I’ve changed the line to this, which seems to help in this rare situation. if (alternativeScore >= bestScore + MIN_SCORE_DIFF && (alternativeScore >= MIN_ALTERNATIVE_SCORE || bestScore<0.01)
Note I’ve been testing this rule with possible common typos.
i.e. mistyping confirmation as conformation which is because the I and O letters are next to each other on the keyboard. And this is something that a spell-checker would miss.
I’ve been working on a way to use a remote NGram database and have written my own class ConfusionProbabilityRemoteRule, which I’m using with the standalone interface. However, the issue is that the “score” function can be quite slow in this case. So for this option to be effective I really only want to call the confusion prob rule for changed text which is within the NGram range.
Using a remote NGram database is probably something you wouldn’t use on the server version of LanguageTool.
It’s work in progress and I may post more details in future if I refine it further.
Anyway, I’m looking forward to see how this functionality develops in future releases.
thanks for the feedback. Please have a look at the recent version (in git), it has no MIN_SCORE_DIFF and MIN_ALTERNATIVE_SCORE anymore and the results should be better. ngram data size has increased though, as we now use also 1grams additionally to 2grams and 3grams. I’ll also try to see if 4grams help.
Ok Thanks I see that the ConfusionProbabilityRemoteRule rule has been updated quite a bit, around the end of May.
When I have time I’ll take a look and see how this new rule works.
I’ve notices the score function is no more and seems to be replaced with getProbabilityFor
We’ve extended the list of supported word pairs in the last days. This is the current list of words that can often (not always) be detected if they are confused:
accept, except
ate, eight
bean, been
but, butt
buy, bye
breathe, breath
dessert, desert
effect, affect
extent, extend
first, fist
full, fill
four,for
know, now
loose, lose
news, new
nice, mice
our, out
pray, prey
proof, prove
rite, right
their, there
then, than
think, thing
to, the
whether, weather
you, your
Awesome. Except that I’m pretty sure not everybody is comfortable sending unpublished text to some web server. I’m pretty sure this is disallowed in many companies and universities.
But still, a nice feature for private users, so thanks a lot for that.