Retrieving inflected forms of a word

zandersmith · September 5, 2014, 8:45pm

Hi. LanguageTool is great! Thanks to everyone who works on it.

I’m wondering if it’s possible for LanguageTool to tell me all the infected forms of a word? Alternatively, can it tell me the root of a word or if a word is an infected form of some base word?

For example, in Spanish, can LanguageTool tell me all the conjugated forms of the very “ir”? Or, that “voy” is an inflected form of the verb “ir”?

The same question goes for English, can I know the inflected forms of the verb “to look”? Or that foxes is an inflected form of the noun “fox”?

Alexander

dnaber · September 5, 2014, 9:17pm

LT does both internally, at least for a lot of languages, including Spanish and English. But there’s no way in the user interface to easily access that information. If you need the data, you can export it to text files as described at Developing a tagger dictionary - LanguageTool Wiki

zandersmith · September 8, 2014, 6:38pm

Thanks for the reply. Perhaps this could be a future enhancement. In the meantime, I’ll see if I can’t use the technique that you outlined.

Cheers.

Dominique_PELLE · September 9, 2014, 8:53pm

If you use the comand line, using the -v option (verbose option),
then LanguageTool tells you the POS tags and lemma of each token,
as well as what disambiguator rules kick in. Example:

$ echo "The foxes" | java -jar languagetool-standalone/target/LanguageTool-2.7-SNAPSHOT/LanguageTool-2.7-SNAPSHOT/languagetool-commandline.jar -c utf-8 -l en-US -v
Expected text language: English (US)
Working on STDIN...
1108 rules activated for language English (US)
<S> The[the/DT,B-NP-plural] foxes[fox/NNS,fox/VBZ,</S>,E-NP-plural]<P/> 
Disambiguator log: 

Time: 2338ms for 1 sentences (0.4 sentences/sec)

Is this what you need?

zandersmith · November 18, 2014, 11:40pm

Dominique, sorry for the late reply. What you wrote was helpful, but I really wanted to do it through Java code so it seems I can with something like this…

JLanguageTool testTool = new JLanguageTool(language);

try
{
AnalyzedSentence sentence = testTool.getAnalyzedSentence(“The dog went running through the park.”);
AnalyzedTokenReadings[] tokens = sentence.getTokensWithoutWhitespace();

for (AnalyzedTokenReadings token : tokens)
{
List aTokenList = token.getReadings();

for (AnalyzedToken atoken : aTokenList)
{
  System.out.println(atoken.getPOSTag() + " : " + atoken.getTokenInflected());
}

}
}
catch (Exception x)
{
x.printStackTrace();
}

This produces the following output…

SENT_START :
DT : the
NN : dog
VBD : go
JJ : running
NN:U : running
VBG : run
IN : through
JJ : through
RP : through
DT : the
NN : park
. : .
SENT_END : .

This works for me somewhat, but want I really want is to get all inflected forms for some token.

dnaber · November 19, 2014, 8:04am

Programmatically, you can get the inflected forms with a synthesizer:
https://languagetool.org/development/api/org/languagetool/synthesis/en/EnglishSynthesizer.html
Use “.*” as a regular expression for the POS tag to get all forms.

zandersmith · November 19, 2014, 8:44pm

Thanks! I can certainly use Synthesizers for some things. It would be nice if I could get the POS tag for each of the inflected forms returned by synthesize(). Is there any way to do that? Also, why do some languages (e.g. Portuguese) not have a Synthesizer available?

dnaber · November 20, 2014, 8:00am

There’s no direct way, but instead of getting all tags at once with “.*” you can get the tags one by one. All known tags for English are in this file: ./languagetool-language-modules/en/target/classes/org/languagetool/resource/en/english_tags.txt (similar for other languages).

Some languages have no synthesizer because their maintainers (if there is a maintainer) haven’t added one yet.