Creating LanguageTool N-Gram Rules

NickHough · October 21, 2015, 4:47am

Hi,

I’m interested in expanding the N-gram rules for English.

I have downloaded the N-gram data set and have set up ConfusionRuleEvaluator in my IDE. I then need to give the arguments.

I am fine with the first 2 arguments, but the third one requires a list of correct sentences. The documentation states that a good start is a combination of Tatoeba and Wikipedia sentences for English. Where can I access/generate this file of example sentences?

dnaber · October 21, 2015, 9:05am

Great that someone is working on this! At least for English, the ngram approach has a lot of potential.

Tatoeba (http://tatoeba.org) is already plain text and can easily be filtered as described here:
Update grammar.xml by Mility · Pull Request #324 · languagetool-org/languagetool · GitHub

Wikipedia can be downloaded from Index of /enwiki/ - as the files are huge, you probably only want to download a part of it. You can then use org.languagetool.dev.dumpcheck.WikipediaSentenceExtractor to get plain sentences from it (well, extraction often isn’t very clean due to the complexity of wikitext).