Info [nl] Corpus quality

Ruud_Baars · February 22, 2018, 7:37am

Over the years, I collected a lot of text from the Internet, around 15GB. But when i split it in sentences, keeping only this that start and end valid (dropping headings etc.) and have no unknown word in them, only 500Mb remains.
Last 2 days, I ran LT on those to capture the errors detected for further processing, setting apart all sentences in which there was no error detected. 330 MB remained.

But having a look in those sentences, there were still a lot of errors in those.

I will put those sentences on-line for manual correction, having the LT plug-in at hand for help. I hope to be able to collect bad-good-combinations that way, for later use.

dnaber · February 22, 2018, 8:08am

That sounds like a good plan.

By “LT plug-in” you mean the browser add-on, don’t you? But it would only be useful so users don’t introduce new mistakes (as the existing mistake, if any, cannot be detected by LT yet), is that correct?

Ruud_Baars · February 22, 2018, 8:15am

Right.