Back to LanguageTool Homepage - Privacy - Imprint

Creating LanguageTool N-Gram Rules


(NickHough) #1

Hi,

I'm interested in expanding the N-gram rules for English.

I have downloaded the N-gram data set and have set up ConfusionRuleEvaluator in my IDE. I then need to give the arguments.

I am fine with the first 2 arguments, but the third one requires a list of correct sentences. The documentation states that a good start is a combination of Tatoeba and Wikipedia sentences for English. Where can I access/generate this file of example sentences?


(Daniel Naber) #2

Great that someone is working on this! At least for English, the ngram approach has a lot of potential.

Tatoeba (http://tatoeba.org) is already plain text and can easily be filtered as described here:

Wikipedia can be downloaded from https://dumps.wikimedia.org/enwiki/ - as the files are huge, you probably only want to download a part of it. You can then use org.languagetool.dev.dumpcheck.WikipediaSentenceExtractor to get plain sentences from it (well, extraction often isn't very clean due to the complexity of wikitext).