Dutch optional dash

Ruud_Baars · July 18, 2017, 10:22am

In Dutch, a dash can be put on any compounding border anywhere in a word when the reader thinks it is better for readability reasons.
Of course it is possible to add all those forms to the postagging dictionary, but maybe it is quite easy to check the word without dashes in the postagging dict when finding with one did not hit. Is this possible, and if so, are there more languages that have an issue like this?

dnaber · July 18, 2017, 1:10pm

German is in the same situation. The tagger has been adapted to try to split words if the word itself is not in the POS dictionary. If it can be split, the last part is used to determine the POS of the whole word. How does Dutch store its compounds in the POS dict, with or without dashes, and is it always done this way?

Ruud_Baars · July 18, 2017, 1:45pm

Currently, the words are stored the way they are found. But that could be changed. The software that German uses is not fit for Dutch, at least when i checked last time. We have some special rules for hyphens. It is required to add a hyphen to compounds that have letter combinations a/e a/a i/j (and quite a bit more) when they are on the boundary of word parts.

Or have there been changes?

SkyCharger001 · July 18, 2017, 2:03pm

I’ve been thinking, perhaps we should have a simple way of differentiating between optional and mandatory hyphens in compound-words.
EG: single hyphen “-” is optional, double hyphen “–” is mandatory.

dnaber · July 18, 2017, 2:05pm

No, but the German compound splitter just splits compound words, no matter whether they are correct. For example, some German compound words require an “s” to be added in between the words. However, they get split with or without the “s”. That might be good enough for Dutch, too. The process needs a list of words (compound parts) and this needs to be converted to binary format. So it’s a bit of work.

Ruud_Baars · July 18, 2017, 2:30pm

Using a = or ~ is also an option.

Ruud_Baars · July 18, 2017, 2:33pm

It might be helpful, but also very misleading because of words like =taal and =staal, that need different tags.
Whether there should be an s is mostly decided by the first part, sometimes by the second.
This could be solved by giving those auto split words a ‘unsure’ postag, for the disambiguator to improve …

SkyCharger001 · July 18, 2017, 2:49pm

true, but
A. ‘=’ can be found in in at least one actual ‘word’ “C=” (common abbreviation for Commodore Business Machines that mimics their logo)
B. (more importantly) my version is in the spirit of escape-characters (EG: “\n” for newline, “\” for just “”)

Ruud_Baars · July 18, 2017, 4:19pm

I don’t consider that a word, but a case of bad naming

Ruud_Baars · July 18, 2017, 5:39pm

Apperently, the connecting s is a significant factor. I don’t want to mess up the MS way.
It is quite easy to generate the versions with optional dash from the hyphenation data I put together. And the compounding is covered quite well in Hunspell. So combining all sources can result in a large, well covering postag list. Enhancing postag guessing in the disambiguator will do the rest for now. Enhancing the code for the uncompounder is out of my league.