Back to LanguageTool Homepage - Privacy - Imprint

[de] Compound words consisting of a disambiguation rule and a dictionary word?


(Martin von Wittich) #1

Hi,

I've noticed that LT automatically understands compound words (I hope this is the right term) that consist of two dictionary words. For example, by default LT doesn't understand "PNG-Datei":

www0.iserv.eu ~/LanguageTool-3.5 # echo "PNG-Datei" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
1.) Line 1, column 1, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Pol-Datei
PNG-Datei 
^^^^^^^^^ 
Time: 462ms for 1 sentences (2.2 sentences/sec)

But this is easily fixable by adding "PNG" to hunspell/spelling.txt:

www0.iserv.eu ~/LanguageTool-3.5 # tail -n1 org/languagetool/resource/de/hunspell/spelling.txt
PNG

www0.iserv.eu ~/LanguageTool-3.5 # echo "PNG-Datei" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
Time: 347ms for 1 sentences (2.9 sentences/sec)

So it understands that "PNG-Datei" is "PNG" combined with "Datei", and because these two are now valid words, "PNG-Datei" is also valid. Unfortunately hunspell/spelling.txt doesn't allow for mulitple words, e.g. I can't add "Portable Network Graphics" to the list; therefor I use a rule in the disambiguation.xml:

    <rule name="Portable Network Graphics" id="PORTABLE_NETWORK_GRAPHICS">
        <pattern>
            <token>Portable</token>
            <token>Network</token>
            <token>Graphics</token>
        </pattern>
        <disambig action="ignore_spelling"/>
    </rule>

This works on its own:

www0.iserv.eu ~/LanguageTool-3.5 # echo "Portable Network Graphics" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
Time: 352ms for 1 sentences (2.8 sentences/sec)

But it no longer works when I use that in a compound word:

www0.iserv.eu ~/LanguageTool-3.5 # echo "Portable Network Graphics-Datei" | java -jar languagetool-commandline.jar -l de-DE
Expected text language: German (Germany)
Working on STDIN...
1.) Line 1, column 10, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Netbook; Neuwerk
Portable Network Graphics-Datei 
         ^^^^^^^                

2.) Line 1, column 18, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Graphits-Datei; Graphems-Datei; Graphik-Datei; Graphisch-Datei; Graphit-Datei; Graphite-Datei; Gryphius-Datei
Portable Network Graphics-Datei 
                 ^^^^^^^^^^^^^^ 
Time: 456ms for 1 sentences (2.2 sentences/sec)

Is there a way to adapt my disambiguation rule to fix that?


(Daniel Naber) #2

Actually, it should even be "Portable-Network-Graphics-Datei" (see here). There's no clean solution I can think of, but please see this thread.