Creating LanguageTool N-Gram Rules


I’m interested in expanding the N-gram rules for English.

I have downloaded the N-gram data set and have set up ConfusionRuleEvaluator in my IDE. I then need to give the arguments.

I am fine with the first 2 arguments, but the third one requires a list of correct sentences. The documentation states that a good start is a combination of Tatoeba and Wikipedia sentences for English. Where can I access/generate this file of example sentences?

Great that someone is working on this! At least for English, the ngram approach has a lot of potential.

Tatoeba ( is already plain text and can easily be filtered as described here:
Update grammar.xml by Mility · Pull Request #324 · languagetool-org/languagetool · GitHub

Wikipedia can be downloaded from Index of /enwiki/ - as the files are huge, you probably only want to download a part of it. You can then use to get plain sentences from it (well, extraction often isn’t very clean due to the complexity of wikitext).