What I have is rather raw material. That is why I am asking; preparations. If plain text (UTF-8), one sentence per line is a good start, that is what I can make.
Unfortunately, most of the data is including (many ‘normal’) errors, even after extensive filtering. But then… an error should also be detecting in a sentences that is less than perfect, right?
I assume token separator is space.
I do have a bit of correct<->incorrect pairs, but by far not enough for NL, I think. There are just not enough people helping out for Dutch.
For now, we would like to be able to do better and more word confusion to start with.
By the way; I did use a java app to split text into sentences a long, long time ago. Is it still around somewhere, fit to use the LT SRX file?