Back to LanguageTool Homepage - Privacy - Imprint

[de] Mistakes with compound nouns

(Michael) #1

In Germany many people write compound nouns as separate words. This may be because of the influence of the English language. LanguageTool should be able to correct such mistakes. Sadly, at the moment it does not. For example this sentence:

"Neben Kinder Garten und Grund Schule gibt es in der Straße auch einen Bau Markt."

According to LanguageTool this is a correct sentence but actually it should read "Kindergarten", "Grundschule" and "Baumarkt". Is there any way to create a rule that detects such mistakes? Or is the only possibility to take all compound nouns that are listed in the dictionary and create one rule for each of them?

(Michael) #2

Thinking about it, it seems to me that this kind of mistakes can't be corrected automatically at all because there are often cases where the two separate words can follow each other:

"Wenn deine Kinder Garten und Wald nur aus dem Bilderbuch kennen, machst du in deiner Erziehung etwas falsch."

Any opinions/suggestions?

(Jan Schreiber) #3

The easiest way to deal with those is adding them to de/compounds.txt. The problem is there are a lot of false positives if the context is not considered. E.g. "Morgen haben wir aus irgendeinem Grund Schule, obwohl Samstag ist."

(Michael) #4

Thanks for the link. The problem you mention does exist for many words that are already on compounds.txt. For example

"Wenn Edison Effekt und Affekt verwechselt hat, bin ich froh, dass er Physiker und nicht Linguist war."

So the question for me is: Should I go ahead and add words to that list because I think that the mistakes do occur more often than the correct uses? Or is it the other way around, and we should delete all words from that list that can be written separately in some contexts?

(Jan Schreiber) #5

I think that is the most pragmatic solution. If you could send us a pull request, that would be great! We run an automated test with 40,000 sentences every night, so we'll probably notice if there are too many false positives. Adding some cases with antipatterns to grammar.xml is also a possibility.

(Michael) #6

Sorry, I'm new. What's a pull request and how do I send it?

(Jan Schreiber) #7

If that is too much hassle, just send the edited file to jan.schreiber ât and I will take care of the rest. Thanks in advance for the help! Really, really appreciated.

(Michael) #8

Thanks for the info. I'm just realizing that I basically suggested that I could make a list of all compound nouns of the German language that may be incorrectly written as separate words. So now I need to decide if I want to quit my job, leave my family and lock myself into a room or if I can find a way to turn this task into something a little more ... manageable. :cold_sweat:

(Jan Schreiber) #9

The problem is far from trivial. A while ago I wrote a rule that searched for
die <Nomen> <Nomen im Fem.>
but it gave false alarms for appositions such as "die Spezies Tulpe".
If you can make a list with a few particularly annoying examples, that would already help.

(Ruud Baars) #10

The same problem is there for Dutch. What is possible is to get the bi-words from a large corpus, and compare the frequency to the compound. It can help identifying cases.
But for a perfect rule, you would have to account for the entire sentence, which is not practical. And even then, there are cases that are correct both ways.