Getting base of word from corrected input

Tomek · May 23, 2015, 11:21am

Hello,
I got following problem: I would like to get bases of words that are returned from LanguageTool correction, I’m using morfologik for this, but some of words are not present there. Is there possibility to take cores directly from LanguageTool during correction process ?

dnaber · May 23, 2015, 12:30pm

I’m not sure I understand your question, could you maybe give an example? If a word isn’t in the internal dictionary, LT won’t know anything about it…

Tomek · May 23, 2015, 1:27pm

I have some sentence -> “Ala ma kóta”. I need to correct it, because there can be some mistakes done within it, so I’m using LT for it.
Next step is to obtain core words (lemmas) for text classification -> I’m using here external morfologik library. In this part morfologik is not recognizing some basic words i.e. “nie”, but LT has no problem to process it (correction or just skip).
In this moment I’m just curious if LT is extracting cores for its own correction purposes, so i can use it instead of my step with morfologik library

Hope it is can help

dnaber · May 23, 2015, 1:32pm

You can use Text Analysis - LanguageTool to see the internal analysis of LT. LT also uses morfologik for finding the base forms of words. But if LT can process a word that doesn’t always mean it knows the base form, as the error detection rule maybe works without the base form.

Tomek · May 23, 2015, 1:46pm

Ok, many thanks for help

I have found there is a function “getAnalyzedSentence” that return basic info about analyzed string. What I’m gonna to do is “analyze” corrected sentence to obtain lemmas and then process them further. What do you think about such idea?

dnaber · May 23, 2015, 1:53pm

That’s a valid approach. You could also use getTagger() of your language class (e.g. “new English().getTagger()”) and then call tag() but the result should be the same.

Tomek · May 23, 2015, 1:56pm

Thanks