Preparing for ML

Ruud_Baars · September 6, 2018, 6:35am

@danielnaber (and others) Can someone tell us what a corpus for ML should look like?
One sentence (paragraph?) per line? Unix line ends? Tokens separated by space?
Should the way it is tokenized be exactly like LT does it (Dutch has some specifics, like apostrophes and hyphens in words)?

So in fact, the plain basics for making such a corpus.

dnaber · September 6, 2018, 6:50am

Converting between line endings is easy. Other than that, I’d use one sentence per line. What kind of corpus do you have, is it sentence pairs incorrect <> correct?

Ruud_Baars · September 6, 2018, 7:04am

What I have is rather raw material. That is why I am asking; preparations. If plain text (UTF-8), one sentence per line is a good start, that is what I can make.

Unfortunately, most of the data is including (many ‘normal’) errors, even after extensive filtering. But then… an error should also be detecting in a sentences that is less than perfect, right?

I assume token separator is space.

I do have a bit of correct<->incorrect pairs, but by far not enough for NL, I think. There are just not enough people helping out for Dutch.

For now, we would like to be able to do better and more word confusion to start with.

By the way; I did use a java app to split text into sentences a long, long time ago. Is it still around somewhere, fit to use the LT SRX file?

curon · September 6, 2018, 10:11pm

Not sure if this is of any use, but I came across this recently:

github.com

LanguageMachines/CLIN28_ST_spelling_correction/blob/master/README.md

# CLIN 2018 Shared Task: Spelling Correction

## Introduction

This repository harbors the scripts for handling the data that is part of the CLIN28 shared task on spelling correction.

Automatic spell checking and correction has been subject of research for decades. Although state of the art spell checkers perform reasonably well for everyday-life applications, reaching high accuracy remains to be a challenging task. This shared task focuses on the detection and correction of spelling errors in Dutch Wikipedia texts. Wikipedia articles aim to be standard-Dutch texts, which may contain jargon. In particular, this task addresses the detection and correction of the types of spelling errors listed in the next section.

Note the following:
* Submitted spelling correctors will be evaluated for detection and correction of these – and only these – types of errors.
* The spelling errors do not have to be categorized into the categories that are listed below – only detected and corrected.
* In case of officially accepted spelling variation or doubt about the correct spelling, all correct variants are accepted.
* The corrections are evaluated in accordance with the Woordenlijst Nederlandse Taal (http://woordenlijst.org/) and the Leidraad (http://woordenlijst.org/leidraad).

## Errors to detect and correct

* **real-word confusions** (``confusion``), word is confused with a near neighbor (confusion with non-native spelling, homophony, grammatical errors, et cetera):
  * ik wordt → ik word
  * stijl → steil
  * hobbies → hobby’s

This file has been truncated. show original

Ruud_Baars · September 7, 2018, 9:18am

Yes, there is a lot going on. But until now, it is all about research and prototypes; nothing that I could integrate with LT.
I am talking wit a university lately to direct their attention to LanguageTool and the possibilities to make research of practical use without much effort.

Maybe one of the results ot this CLIN effort could be added to LT. I will try to get some attention there as well.

Addition: I contacted the person mentioned as well, since we met once a long time ago. Who knows what may come out of it.