Back to LanguageTool Homepage - Privacy - Imprint

Creating LanguageTool N-Gram Rules

(NickHough) #1


I’m interested in expanding the N-gram rules for English.

I have downloaded the N-gram data set and have set up ConfusionRuleEvaluator in my IDE. I then need to give the arguments.

I am fine with the first 2 arguments, but the third one requires a list of correct sentences. The documentation states that a good start is a combination of Tatoeba and Wikipedia sentences for English. Where can I access/generate this file of example sentences?

(Daniel Naber) #2

Great that someone is working on this! At least for English, the ngram approach has a lot of potential.

Tatoeba ( is already plain text and can easily be filtered as described here:

Wikipedia can be downloaded from - as the files are huge, you probably only want to download a part of it. You can then use to get plain sentences from it (well, extraction often isn’t very clean due to the complexity of wikitext).