Back to LanguageTool Homepage - Privacy - Imprint

Stemming and/or lemmatization


(Stefam Blomen) #1

Hi,
I would like to user LanguageTool for germal language stemming and/or lemmatization. I found 3 solutions by searching here and by google. The results are interesting because the first two solutions do not work correctly for all input words (and I don’t understand why) and solution 3 works but seems way to “complicated”:

Solution 1 uses “DictionaryLookup” and workds on top of the underlying dictionary.
Solution 2 uses “AnalyzeText” and works with a LanguageTool instance.
Solution 3 uses a custom HunspellStemmer and works on top of the underlying hunspell data.

Sourcecode see below. The output is:

Solution 1 (DictionaryLookup)
toller: toll toll toll toll toll toll toll toll
zumutbarer:
Solution 2 (AnalyzeText)
toller: null toll toll toll toll toll toll toll toll toll toll
zumutbarer: null null null
Solution 3 (HunspellStemmer)
toller: toll
zumutbarer: zumutbar

As you can see, only solution 3 can stem the german words “toller” and “zumutbarer”, whereas the other solutions can stem/lemmatize only “toller” but not “zumutbarer”. Why not?

Source code:

package com.mycompany.languagetooltestmaven;

import java.io.IOException;
import java.util.List;
import java.util.logging.Level;
import java.util.logging.Logger;
import morfologik.stemming.Dictionary;
import morfologik.stemming.DictionaryLookup;
import morfologik.stemming.WordData;
import org.languagetool.AnalyzedSentence;
import org.languagetool.AnalyzedToken;
import org.languagetool.AnalyzedTokenReadings;
import org.languagetool.JLanguageTool;
import org.languagetool.language.GermanyGerman;

public class TestMain
{

    final static String[] STRINGS_TO_CHECK = new String[]
    {
        "toller", "zumutbarer"
    };

    private static void Solution1()
    {
        System.out.println("Solution 1 (DictionaryLookup)");
        for (String str : STRINGS_TO_CHECK)
        {
            try
            {
                System.out.print(str + ":");
                Dictionary dictionary = Dictionary.read(JLanguageTool.getDataBroker().getFromResourceDirAsUrl("/de/german.dict"));
                DictionaryLookup dictionaryLookup = new DictionaryLookup(dictionary);
                List<WordData> lookup = dictionaryLookup.lookup(str);
                for (WordData wd : lookup)
                {
                    System.out.print(" " + wd.getStem());
                }
                System.out.println("");
            }
            catch (IOException ex)
            {
                Logger.getLogger(TestMain.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
    }

    private static void Solution2()
    {
        System.out.println("Solution 2 (AnalyzeText)");
        JLanguageTool lt = new JLanguageTool(new GermanyGerman());
        for (String str : STRINGS_TO_CHECK)
        {
            try
            {
                System.out.print(str + ":");
                List<AnalyzedSentence> analyzedSentences = lt.analyzeText(str);
                for (AnalyzedSentence analyzedSentence : analyzedSentences)
                {
                    for (AnalyzedTokenReadings analyzedTokens : analyzedSentence.getTokensWithoutWhitespace())
                    {
                        for (AnalyzedToken at : analyzedTokens.getReadings())
                        {
                            System.out.print(" " + at.getLemma());
                        }
                    }
                }
                System.out.println("");
            }
            catch (IOException ex)
            {
                Logger.getLogger(TestMain.class.getName()).log(Level.SEVERE, null, ex);
            }
        }
    }

    private static void Solution3()
    {
        System.out.println("Solution 3 (HunspellStemmer)");
        HunspellDictionary hunspellDictionary = null;
        try
        {
            hunspellDictionary = new HunspellDictionary(JLanguageTool.getDataBroker().getFromResourceDirAsStream("/de/hunspell/de_DE.aff"), JLanguageTool.getDataBroker().getFromResourceDirAsStream("/de/hunspell/de_DE.dic"));
            for (String str : STRINGS_TO_CHECK)
            {
                HunspellStemmer hunspellStemmer = new HunspellStemmer(hunspellDictionary);
                List<HunspellStemmer.Stem> stem = hunspellStemmer.stem(str);
                System.out.print(str + ":");
                for (HunspellStemmer.Stem st : stem)
                {
                    System.out.print(" " + st.getStemString());
                }
                System.out.println("");
            }
        }
        catch (Exception ex)
        {
            Logger.getLogger(TestMain.class.getName()).log(Level.SEVERE, null, ex);
        }
    }

    public static void main(String[] args) throws IOException
    {
        Solution1();
        Solution2();
        Solution3();
    }
}

(Daniel Naber) #2

Solution 1 and 2 are purely dictionary based. They won’t do anything with words not in the dictionary. The only exception I can think of is German noun compounds which are often analyzed even when not in the dictionary.