Hi,
I would like to user LanguageTool for germal language stemming and/or lemmatization. I found 3 solutions by searching here and by google. The results are interesting because the first two solutions do not work correctly for all input words (and I don’t understand why) and solution 3 works but seems way to “complicated”:
Solution 1 uses “DictionaryLookup” and workds on top of the underlying dictionary.
Solution 2 uses “AnalyzeText” and works with a LanguageTool instance.
Solution 3 uses a custom HunspellStemmer and works on top of the underlying hunspell data.
Sourcecode see below. The output is:
Solution 1 (DictionaryLookup)
toller: toll toll toll toll toll toll toll toll
zumutbarer:
Solution 2 (AnalyzeText)
toller: null toll toll toll toll toll toll toll toll toll toll
zumutbarer: null null null
Solution 3 (HunspellStemmer)
toller: toll
zumutbarer: zumutbar
As you can see, only solution 3 can stem the german words “toller” and “zumutbarer”, whereas the other solutions can stem/lemmatize only “toller” but not “zumutbarer”. Why not?
Source code:
package com.mycompany.languagetooltestmaven;
import java.io.IOException;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;
import morfologik.stemming.Dictionary;
import morfologik.stemming.DictionaryLookup;
import morfologik.stemming.WordData;
import org.languagetool.AnalyzedSentence;
import org.languagetool.AnalyzedToken;
import org.languagetool.AnalyzedTokenReadings;
import org.languagetool.JLanguageTool;
import org.languagetool.language.GermanyGerman;
public class TestMain
{
final static String[] STRINGS_TO_CHECK = new String[]
{
"toller", "zumutbarer"
};
private static void Solution1()
{
System.out.println("Solution 1 (DictionaryLookup)");
for (String str : STRINGS_TO_CHECK)
{
try
{
System.out.print(str + ":");
Dictionary dictionary = Dictionary.read(JLanguageTool.getDataBroker().getFromResourceDirAsUrl("/de/german.dict"));
DictionaryLookup dictionaryLookup = new DictionaryLookup(dictionary);
List<WordData> lookup = dictionaryLookup.lookup(str);
for (WordData wd : lookup)
{
System.out.print(" " + wd.getStem());
}
System.out.println("");
}
catch (IOException ex)
{
Logger.getLogger(TestMain.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
private static void Solution2()
{
System.out.println("Solution 2 (AnalyzeText)");
JLanguageTool lt = new JLanguageTool(new GermanyGerman());
for (String str : STRINGS_TO_CHECK)
{
try
{
System.out.print(str + ":");
List<AnalyzedSentence> analyzedSentences = lt.analyzeText(str);
for (AnalyzedSentence analyzedSentence : analyzedSentences)
{
for (AnalyzedTokenReadings analyzedTokens : analyzedSentence.getTokensWithoutWhitespace())
{
for (AnalyzedToken at : analyzedTokens.getReadings())
{
System.out.print(" " + at.getLemma());
}
}
}
System.out.println("");
}
catch (IOException ex)
{
Logger.getLogger(TestMain.class.getName()).log(Level.SEVERE, null, ex);
}
}
}
private static void Solution3()
{
System.out.println("Solution 3 (HunspellStemmer)");
HunspellDictionary hunspellDictionary = null;
try
{
hunspellDictionary = new HunspellDictionary(JLanguageTool.getDataBroker().getFromResourceDirAsStream("/de/hunspell/de_DE.aff"), JLanguageTool.getDataBroker().getFromResourceDirAsStream("/de/hunspell/de_DE.dic"));
for (String str : STRINGS_TO_CHECK)
{
HunspellStemmer hunspellStemmer = new HunspellStemmer(hunspellDictionary);
List<HunspellStemmer.Stem> stem = hunspellStemmer.stem(str);
System.out.print(str + ":");
for (HunspellStemmer.Stem st : stem)
{
System.out.print(" " + st.getStemString());
}
System.out.println("");
}
}
catch (Exception ex)
{
Logger.getLogger(TestMain.class.getName()).log(Level.SEVERE, null, ex);
}
}
public static void main(String[] args) throws IOException
{
Solution1();
Solution2();
Solution3();
}
}