Available Real-word Error Corpus

Mility · June 22, 2015, 1:22am

Hi Dnaber,
You said that the Jenny Pedler’s Real-word Error Corpus available at http://www.dcs.bbk.ac.uk/~jenny/resources.html, but I can not get it(like below).

Is there other available Real-word Error Corpus?

Thanks
Regards
Mility

dnaber · June 22, 2015, 6:40am

Seems it has been removed… I’ve sent you the corpus via email.

Mility · June 22, 2015, 8:13am

Thanks, I have received.

PeterLawrence · June 23, 2015, 12:25pm

Hi Mility, just wondering what your plans are with the Real-word Error Corpus.
I’ve found her thesis here
http://www.dcs.bbk.ac.uk/research/recentphds/pedler.pdf
However could someone please email me the CurpusFile.zip too.
Thanks

Mility · June 24, 2015, 3:01am

Ok, please leave your email here. I’ll send you the corpus via email.

PeterLawrence · June 24, 2015, 9:12am

Thanks, Daniel sent me the Corpus this morning.
These text files are quite interesting to look through and I expect Mility you’re going to use them for testing.

dnaber · June 24, 2015, 9:43am

I actually found two problems with the Pedler corpus:

It’s too small with only 700 sentences. That means for each word pair like “their/there” you have only very few examples.
As it contains only errors it’s dangerous to use it for optimizing rules. You might get rules that find errors but that also cause a lot of false alarms.

So I don’t use the Pedler corpus but the ConfusionRuleEvaluator class. It can read sentences from Wikipedia and Tatoeba and assumes they are correct (that works quite well, though not 100% of course). It tests if these correct sentences cause an error (thus a false alarm). Then, it replaces a word with its homophone (e.g. there → their), generating a (probably) wrong sentence and it checks if the error can be found. As output you get figures like this:

Precision: 0.998 (3 false positives)
Recall: 0.970 (60 false negatives)
F-measure: 0.993 (beta=0.5)

This way you can make sure the confusion rule works well for a homophone pair, finding errors without causing a lot of false alarms. ConfusionRuleEvaluator is meant to be used for the confusion rule that works with ngram data, not for standard pattern rules.

Source code: languagetool/ConfusionRuleEvaluator.java at master · languagetool-org/languagetool · GitHub

Mility · June 24, 2015, 10:08am

yes, the Pedler corpus has only 834 real-eord errors, I just use it test the realWordErrorRule(a rule made by myself). On the beginning, I want to use Google ngrams data, but I don’t know what’s wrong with the big-data from: Index of /download/ngram-data/, I tried many times to download and failed finally. Also, some confusion set of Pedler corpus are not one by one(e.g.there->their,they,they’re), how should we deal with this?

PeterLawrence · June 24, 2015, 10:09am

Yes I agree these files are not great for software testing. However, they do provide some examples of typos in relation to dyslexia. I’ve not checked the documents in detail, but I’ve also noticed problems with missing plurals, definite article (the) and past tenses (-ed).
I was kind of wondering if LanguageTool would identify all the errors in this corpus.

Mility · June 24, 2015, 10:25am

The LT has specialized rules to check missing plurals, definite article (the) and past tenses (-ed), those could be classified as grammar error.

dnaber · June 24, 2015, 11:04am

If download of the ngram data fails I suggest you try using a download manager, it should be able to resume interrupted downloads.

PeterLawrence · June 24, 2015, 12:03pm

Yes LT is quite good at issues related to past tenses, but I’m not sure it does that will with missing definite article (the) issue.
See https://learnenglish.britishcouncil.org/en/english-grammar/determiners-and-quantifiers/definite-article.
However, it would be interesting to see how well LT performs on these texts.

Mility · June 25, 2015, 12:07pm

you are right, I used a popular download manager in our country to download, but it seems to not support. Which kinds of download manager do you use?

dnaber · June 25, 2015, 2:49pm

I’ve never needed a download manager so far. But I’ve removed the password protection from the ngram files, you should try again now.

Mility · June 26, 2015, 11:46am

Thanks, It can be used a download manager to download now.