Spellchecking and postagging

Ruud_Baars · February 28, 2018, 7:36am

Because Dutch is a compounding language, it is never possible to postag all words, nor to have all words in a list.
I am thinking about having a compound analyzer (I know German has one, forgot the name) to perform additional spellchecking if a word is not in the fixed list, as well as assign postag(s), since the last part of a compound is decisive in this area.
It would need to have compounding parts (not all valid words) for first, middle and last positions, so
compound_part;positions;postags could simply be the list structure.

dnaber · February 28, 2018, 8:58am

The German compound analyzer is this one: GitHub - danielnaber/jwordsplitter: small Java library for splitting German compound words

Ruud_Baars · February 28, 2018, 10:49am

Would it be okay if I tried to copy it and adjust for Dutch?

dnaber · February 28, 2018, 10:56am

For testing, yes. In the end, we wouldn’t want to have a copy but a version of jwordsplitter that also supports Dutch. Key to that is a list of words (nouns) which is used to split compounds in parts.

Ruud_Baars · February 28, 2018, 11:26am

I understand. But I am quite sure what is needed, but not if it can be fit into the JwordSplitter. I will look into it. For now, I see a few resource files, used to add or delete from a set I did not find yet.
Will have to dive into the code in time to come. For now I will do what I can with the disambiguator on guessing tags.

dnaber · February 28, 2018, 11:45am

The main dictionary is this: jwordsplitter/src/main/resources/de/danielnaber/jwordsplitter/languagetool-dict.txt at master · danielnaber/jwordsplitter · GitHub

Ruud_Baars · February 28, 2018, 11:54am

Ah. Of course. But it is all words, no tags.

I see the data does not know about the possible concatenation characters. Even though this is considered ‘free to choose’ in Dutch, there are some compound parts that always have them, some that don’t.
So I use ‘part+infix s and s-’ in my php code for non-last parts, and do not apply infixes at the end (of course), but that means some parts can only happen at the end, some only first (rare), many in any middle position.
There is a minimal length, which is great. But this should be at least 5 to be safe (for Dutch), but there are lots of compounds with 3 and 2 (as = axis, das=shawl). To avoid mishaps, a boundary regexp would be a good addition, which could also take care of rejecting auto+onderdeel for spellchecking, since the - is required: auto-onderdeel. autoonderdeel could still be correctly postagged as a noun though. Same for borders with a capital (ZuidDuits should be Zuid-Duits). But I would eclude camelcase words anyway, since these are proper names…