LT Evaluation

Hi LT dev community,

I was wondering if LT has ever been evaluated by the academic community with resources such as NUCLE or FCE datasets, such that we could know for sure which errors caught by LT are True Positives and which ones are False Positives. Additionally, I’d like to know if a similar effort has ever been made to evaluate the usefulness of the feedback given in each error correction.

Consider this paper on the evaluation of Grammar Error Correction methods.

1 Like

Hi,

I’m also interested in evaluation results. Actually, is there any information on how well language tool performers?

It would not be very trustworthy if anyone from our team did this, would it?
Quite somectime ago I had a look in a test data set, and it was clear to me that it was not a natural set, but a constructed one.
Sets like that carry the view of the creator.
Best would be a collection of several real sets from real individuals and organisations presenting several use cases.
No chance on getting that.

It would be great, though.

I consider there’s no conflict if LanguageTool is evaluated on external resources that are not built by the LT team. Actually, it would be useful if the LT team publishes LT performance on external benchmarks. There are several academically built datasets for Grammar Error Detection and Correction. I list them below:

FCE Datasets – iLexIR
JFLEG JFLEG: A Fluency Corpus and Benchmark for Grammatical Error Correction - ACL Anthology
NUCLE Data | NUS Natural Language Processing Group
AESW AESW 2016

I found a study which compares different tools.

If you are aware of any similar study, please let me know.

This is great! This solves most of my doubts regarding LT evaluation. Thanks for sharing the resource.

Edit: two limitations of this study are the limited number of sentences and error types, and that only the free versions of the tools were checked. I know that LanguagteTool premium checks more errors than the community version, but there’s no such evaluation to decide between Grammarly premium vs LanguageTool premium, which remains a blind spot. I still consider that the best evaluation would be with the academically crafted datasets I mentioned previously.