Invalid analysed tokens is returned by AnalyzedSentence

jonfogh · February 20, 2017, 12:30pm

Hi

We use language tool for lemmatising through DKPro.

DKPro uses LanguageTools as below:

// Let LanguageTool analyze the tokens
List rawTaggedTokens = lang.getTagger().tag(tokenText);
AnalyzedSentence as = new AnalyzedSentence(rawTaggedTokens.toArray(new AnalyzedTokenReadings[rawTaggedTokens.size()]));
as = lang.getDisambiguator().disambiguate(as);

We have seen several times that “lower” is lemmatised wrongly to “lowe”, and “species” to “specie”. “Lowe” isnt a valid word and “specie” has a very difference meaning than “species”.

With input:
“There was a significant increase in phagocytic Activity of WBC as indicated by the lower PI in AD rats compared to that of control and sham-operated rats in both the 15 and 21-day studies.”

The following Analysed tokens is returned for lower:
lower[lower/VB*,lower/VBP*,lowe/JJR*,low/JJR*]

With input:
“PG activate microglia by binding to their EP receptor. Activated microglia release ROS, reactive nitrogen species and neurotoxic cytokines which cause secondary neurodegeneration resulting in the increased number of plaques, as observed in the 21-day study.”

The following Analysed tokens is returned for species:
species[species/NN*,specie/NNS*]

Are there any logic explanation to why “specie” and “lowe” are returned as tokens, or is it correct that it identifies some bugs in language tools?

dnaber · February 20, 2017, 2:03pm

Thanks for the report, these seem to be real bugs. I’ve removed those forms. This change will be in the next snapshot.

jonfogh · February 20, 2017, 10:13pm

Thanks for the quick reply!