Preparing for ML

@danielnaber (and others) Can someone tell us what a corpus for ML should look like?
One sentence (paragraph?) per line? Unix line ends? Tokens separated by space?
Should the way it is tokenized be exactly like LT does it (Dutch has some specifics, like apostrophes and hyphens in words)?

So in fact, the plain basics for making such a corpus.

Converting between line endings is easy. Other than that, I’d use one sentence per line. What kind of corpus do you have, is it sentence pairs incorrect <> correct?

What I have is rather raw material. That is why I am asking; preparations. If plain text (UTF-8), one sentence per line is a good start, that is what I can make.

Unfortunately, most of the data is including (many ‘normal’) errors, even after extensive filtering. But then… an error should also be detecting in a sentences that is less than perfect, right?

I assume token separator is space.

I do have a bit of correct<->incorrect pairs, but by far not enough for NL, I think. There are just not enough people helping out for Dutch.

For now, we would like to be able to do better and more word confusion to start with.

By the way; I did use a java app to split text into sentences a long, long time ago. Is it still around somewhere, fit to use the LT SRX file?

Not sure if this is of any use, but I came across this recently:

Yes, there is a lot going on. But until now, it is all about research and prototypes; nothing that I could integrate with LT.
I am talking wit a university lately to direct their attention to LanguageTool and the possibilities to make research of practical use without much effort.

Maybe one of the results ot this CLIN effort could be added to LT. I will try to get some attention there as well.

Addition: I contacted the person mentioned as well, since we met once a long time ago. Who knows what may come out of it.