Back to LanguageTool Homepage - Privacy - Imprint

Spellchecking and postagging

(Ruud Baars) #1

Because Dutch is a compounding language, it is never possible to postag all words, nor to have all words in a list.
I am thinking about having a compound analyzer (I know German has one, forgot the name) to perform additional spellchecking if a word is not in the fixed list, as well as assign postag(s), since the last part of a compound is decisive in this area.
It would need to have compounding parts (not all valid words) for first, middle and last positions, so
compound_part;positions;postags could simply be the list structure.

(Daniel Naber) #2

The German compound analyzer is this one:

(Ruud Baars) #3

Would it be okay if I tried to copy it and adjust for Dutch?

(Daniel Naber) #4

For testing, yes. In the end, we wouldn’t want to have a copy but a version of jwordsplitter that also supports Dutch. Key to that is a list of words (nouns) which is used to split compounds in parts.

(Ruud Baars) #5

I understand. But I am quite sure what is needed, but not if it can be fit into the JwordSplitter. I will look into it. For now, I see a few resource files, used to add or delete from a set I did not find yet.
Will have to dive into the code in time to come. For now I will do what I can with the disambiguator on guessing tags.

(Daniel Naber) #6

The main dictionary is this:

(Ruud Baars) #7

Ah. Of course. But it is all words, no tags.

I see the data does not know about the possible concatenation characters. Even though this is considered ‘free to choose’ in Dutch, there are some compound parts that always have them, some that don’t.
So I use ‘part+infix s and s-’ in my php code for non-last parts, and do not apply infixes at the end (of course), but that means some parts can only happen at the end, some only first (rare), many in any middle position.
There is a minimal length, which is great. But this should be at least 5 to be safe (for Dutch), but there are lots of compounds with 3 and 2 (as = axis, das=shawl). To avoid mishaps, a boundary regexp would be a good addition, which could also take care of rejecting auto+onderdeel for spellchecking, since the - is required: auto-onderdeel. autoonderdeel could still be correctly postagged as a noun though. Same for borders with a capital (ZuidDuits should be Zuid-Duits). But I would eclude camelcase words anyway, since these are proper names…