Dutch can generate quote long compounds, more and more unreadable. I guess the easiest way to warn about this is to warn by word length using regexp, or more precisely by adding the # of compounds to the noun tag. Any suggestions?
Not super easy, but the correct solution would be to come up with a list of potential compound parts for jWordSplitter, let it split the compound and then check the number of compounds parts. On the other hand, German has the same issue, but we don’t have a rule for that. I’m not sure if it’s actually useful, as the issue in German is rather obvious. So as a first step, a very simple Java rule that checks for word length would probably be enough.
I can think of one type of compound-words that should be excluded,
those that describe (complex) chemical compounds.
I guess warning for too long words might be enough, with some exceptions.
Jwordsplitter would be okay as well, except for tge complete flexibility of the s. And the - needs to be completely flexible. For Dutch anyway.