Over the years, I collected a lot of text from the Internet, around 15GB. But when i split it in sentences, keeping only this that start and end valid (dropping headings etc.) and have no unknown word in them, only 500Mb remains.
Last 2 days, I ran LT on those to capture the errors detected for further processing, setting apart all sentences in which there was no error detected. 330 MB remained.
But having a look in those sentences, there were still a lot of errors in those.
I will put those sentences on-line for manual correction, having the LT plug-in at hand for help. I hope to be able to collect bad-good-combinations that way, for later use.