Improved error detection for English

dnaber · May 27, 2015, 9:02am

We’ve just activated an update for English that improves error detection of some words that are easily confused. The list of words is going to be extended, but for now they are:

accept, except
ate, eight
extent, extend
four, for
know, now
nice, mice
pray, prey
their, there
you, your
rite, right

This doesn’t mean that all the errors where e.g. ‘their’ and ‘there’ are mixed up are detected, but most (>75%). As this feature requires huge amounts of disk space, it is available only on https://languagetool.org, not in the stand-alone version (unless you’re willing to download a few GB of data).

For those interested in the technical details, please see Finding errors using Big Data - LanguageTool Wiki

dnaber · May 27, 2015, 9:18am

Here are some example errors that couldn’t be detected before and now can:

I can’t remember how to go their.
I didn’t now where it came from.
Alabama has for of the world’s largest stadiums.

Mility · May 28, 2015, 7:00am

Thanks, the resource of ngram data is not very good, I can not download. Is there available in other places?

dnaber · May 28, 2015, 8:28am

What do you mean when you say you cannot download the files? Is the connection too slow? Until someone mirrors the data, it’s not available elsewhere.

Mility · May 28, 2015, 8:40am

Thanks, I want to use the data to generate XML rules, is it include grammar rule?

dnaber · May 28, 2015, 8:52am

If you just want to lookup occurrence counts you don’t need to download the data, you can do that at http://corpora.linguistik.uni-erlangen.de/demos/cgi-bin/Web1T5/Web1T5_freq.perl

Mility · May 28, 2015, 9:05am

Thanks, where say “Automatically Create Rules” in Finding errors using Big Data - LanguageTool Wiki , I want to know , is it like ruleEditor2, or something like input the data could automatically generate grammar rule like in grammar.xml?

dnaber · May 28, 2015, 9:15am

I haven’t touched that code for half a year, it might not even work anymore. You will need to try yourself.

Mility · May 28, 2015, 10:11am

ok. Thanks anyway.

PeterLawrence · May 28, 2015, 12:36pm

Hi,
I’ve been testing this functionality for a few weeks now and I feel it could be quite useful.
A few suggestions would be to give the user the ability to configure the attributes MIN_SCORE_DIFF and MIN_ALTERNATIVE_SCORE of the ConfusionProbabilityRule rule via the command line interface.

Also with the test for finding alternative suggestions, see function getBetterAlternativeOrNull of ConfusionProbabilityRule (Note I’ve been looking at version 2.8/2.9 code)
There’s this if condition…
if (alternativeScore >= bestScore + MIN_SCORE_DIFF && alternativeScore >= MIN_ALTERNATIVE_SCORE)

I found that if bestScore has a small value, i.e. the current word has a very low prob, and the alternativeScore never exceeded MIN_ALTERNATIVE_SCORE you wouldn’t get any alternative suggestions.

In my hacked version I’ve changed the line to this, which seems to help in this rare situation.
if (alternativeScore >= bestScore + MIN_SCORE_DIFF && (alternativeScore >= MIN_ALTERNATIVE_SCORE || bestScore<0.01)

Note I’ve been testing this rule with possible common typos.
i.e. mistyping confirmation as conformation which is because the I and O letters are next to each other on the keyboard. And this is something that a spell-checker would miss.

I’ve been working on a way to use a remote NGram database and have written my own class ConfusionProbabilityRemoteRule, which I’m using with the standalone interface. However, the issue is that the “score” function can be quite slow in this case. So for this option to be effective I really only want to call the confusion prob rule for changed text which is within the NGram range.
Using a remote NGram database is probably something you wouldn’t use on the server version of LanguageTool.
It’s work in progress and I may post more details in future if I refine it further.

Anyway, I’m looking forward to see how this functionality develops in future releases.

Thanks

Mility · May 28, 2015, 12:42pm

Thank you very much.

dnaber · May 28, 2015, 3:10pm

Hi Peter,

thanks for the feedback. Please have a look at the recent version (in git), it has no MIN_SCORE_DIFF and MIN_ALTERNATIVE_SCORE anymore and the results should be better. ngram data size has increased though, as we now use also 1grams additionally to 2grams and 3grams. I’ll also try to see if 4grams help.

Regards
Daniel

PeterLawrence · May 28, 2015, 4:55pm

Ok Thanks I see that the ConfusionProbabilityRemoteRule rule has been updated quite a bit, around the end of May.
When I have time I’ll take a look and see how this new rule works.
I’ve notices the score function is no more and seems to be replaced with getProbabilityFor

dnaber · May 31, 2015, 7:18pm

We’ve extended the list of supported word pairs in the last days. This is the current list of words that can often (not always) be detected if they are confused:

accept, except
ate, eight
bean, been
but, butt
buy, bye
breathe, breath
dessert, desert
effect, affect
extent, extend
first, fist
full, fill
four,for
know, now
loose, lose
news, new
nice, mice
our, out
pray, prey
proof, prove
rite, right
their, there
then, than
think, thing
to, the
whether, weather
you, your

emperor · June 12, 2015, 5:24pm

Awesome. Except that I’m pretty sure not everybody is comfortable sending unpublished text to some web server. I’m pretty sure this is disallowed in many companies and universities.

But still, a nice feature for private users, so thanks a lot for that.