Back to LanguageTool Homepage - Privacy - Imprint

Available Real-word Error Corpus

Hi Dnaber,
You said that the Jenny Pedler’s Real-word Error Corpus available at, but I can not get it(like below).

Is there other available Real-word Error Corpus?


Seems it has been removed… I’ve sent you the corpus via email.

Thanks, I have received.

Hi Mility, just wondering what your plans are with the Real-word Error Corpus.
I’ve found her thesis here

However could someone please email me the too.

Ok, please leave your email here. I’ll send you the corpus via email.

Thanks, Daniel sent me the Corpus this morning.
These text files are quite interesting to look through and I expect Mility you’re going to use them for testing.

I actually found two problems with the Pedler corpus:

  1. It’s too small with only 700 sentences. That means for each word pair like “their/there” you have only very few examples.
  2. As it contains only errors it’s dangerous to use it for optimizing rules. You might get rules that find errors but that also cause a lot of false alarms.

So I don’t use the Pedler corpus but the ConfusionRuleEvaluator class. It can read sentences from Wikipedia and Tatoeba and assumes they are correct (that works quite well, though not 100% of course). It tests if these correct sentences cause an error (thus a false alarm). Then, it replaces a word with its homophone (e.g. there -> their), generating a (probably) wrong sentence and it checks if the error can be found. As output you get figures like this:

  • Precision: 0.998 (3 false positives)
  • Recall: 0.970 (60 false negatives)
  • F-measure: 0.993 (beta=0.5)

This way you can make sure the confusion rule works well for a homophone pair, finding errors without causing a lot of false alarms. ConfusionRuleEvaluator is meant to be used for the confusion rule that works with ngram data, not for standard pattern rules.

Source code:

yes, the Pedler corpus has only 834 real-eord errors, I just use it test the realWordErrorRule(a rule made by myself). On the beginning, I want to use Google ngrams data, but I don’t know what’s wrong with the big-data from:, I tried many times to download and failed finally. Also, some confusion set of Pedler corpus are not one by one(e.g.there->their,they,they’re), how should we deal with this?

Yes I agree these files are not great for software testing. However, they do provide some examples of typos in relation to dyslexia. I’ve not checked the documents in detail, but I’ve also noticed problems with missing plurals, definite article (the) and past tenses (-ed).
I was kind of wondering if LanguageTool would identify all the errors in this corpus.

The LT has specialized rules to check missing plurals, definite article (the) and past tenses (-ed), those could be classified as grammar error.

If download of the ngram data fails I suggest you try using a download manager, it should be able to resume interrupted downloads.

Yes LT is quite good at issues related to past tenses, but I’m not sure it does that will with missing definite article (the) issue.
However, it would be interesting to see how well LT performs on these texts.

you are right, I used a popular download manager in our country to download, but it seems to not support. Which kinds of download manager do you use?

I’ve never needed a download manager so far. But I’ve removed the password protection from the ngram files, you should try again now.

Thanks, It can be used a download manager to download now.