[de] Compound words consisting of a disambiguation rule and a dictionary word?

martin.von.wittich · December 14, 2016, 9:10pm

Hi,

I’ve noticed that LT automatically understands compound words (I hope this is the right term) that consist of two dictionary words. For example, by default LT doesn’t understand “PNG-Datei”:

www0.iserv.eu ~/LanguageTool-3.5 # echo "PNG-Datei" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
1.) Line 1, column 1, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Pol-Datei
PNG-Datei 
^^^^^^^^^ 
Time: 462ms for 1 sentences (2.2 sentences/sec)

But this is easily fixable by adding “PNG” to hunspell/spelling.txt:

www0.iserv.eu ~/LanguageTool-3.5 # tail -n1 org/languagetool/resource/de/hunspell/spelling.txt
PNG

www0.iserv.eu ~/LanguageTool-3.5 # echo "PNG-Datei" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
Time: 347ms for 1 sentences (2.9 sentences/sec)

So it understands that “PNG-Datei” is “PNG” combined with “Datei”, and because these two are now valid words, “PNG-Datei” is also valid. Unfortunately hunspell/spelling.txt doesn’t allow for mulitple words, e.g. I can’t add “Portable Network Graphics” to the list; therefor I use a rule in the disambiguation.xml:

    <rule name="Portable Network Graphics" id="PORTABLE_NETWORK_GRAPHICS">
        <pattern>
            <token>Portable</token>
            <token>Network</token>
            <token>Graphics</token>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

This works on its own:

www0.iserv.eu ~/LanguageTool-3.5 # echo "Portable Network Graphics" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
Time: 352ms for 1 sentences (2.8 sentences/sec)

But it no longer works when I use that in a compound word:

www0.iserv.eu ~/LanguageTool-3.5 # echo "Portable Network Graphics-Datei" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
1.) Line 1, column 10, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Netbook; Neuwerk
Portable Network Graphics-Datei 
         ^^^^^^^                

2.) Line 1, column 18, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Graphits-Datei; Graphems-Datei; Graphik-Datei; Graphisch-Datei; Graphit-Datei; Graphite-Datei; Gryphius-Datei
Portable Network Graphics-Datei 
                 ^^^^^^^^^^^^^^ 
Time: 456ms for 1 sentences (2.2 sentences/sec)

Is there a way to adapt my disambiguation rule to fix that?

dnaber · December 15, 2016, 8:22am

Actually, it should even be “Portable-Network-Graphics-Datei” (see here). There’s no clean solution I can think of, but please see this thread.