Back to LanguageTool Homepage - Privacy - Imprint

Articles usage for abbreviations

(Jestha) #1

This is the very first post on the forum and I am very new on this platform. Please help me to understand the algorithm to set tag for each word and phrase level. I have been debugging LT code which I found the class EnglishChunker have PosTags for each token and the ChunkTag for phrase level tagging.

During functional spike testing, I have submitted a sentence "A new ROS, reactive oxygen species found." into analyzeText which I found no posTag for the token ROS. On the other hand, PosTags in java code set the value NNS tag to the token ROS.

Is analyzeText endpoint referred different from EnglishChunker/analyzeText?
Is there any scope of improvement on tagging? As I understood from several posts in this forum that the tagging is not 100%
How would LT handle abbreviations? Please refer the above sentence, ROS is an abbreviation.
As I am very much curious to know more about articles usage for abbreviations, please do share details with me.

Thanking you in advance.

(Daniel Naber) #2

Hi Jestha,

thanks for your interest in LT.

For the same version it should be the same. usually gets updated automatically every 24 hours, so it should usually be almost the same as the latest version from git.

Tagging should be mostly correct (even though not 100%), but not all ambiguities are always resolved. It can be improved by adding words and their tags to en/added.txt or by fixing/extending disambiguation.xml. Chunking is done by an external library (OpenNLP) on which we don't have much influence.

There are no special cases for abbreviations, they can be added to en/added.txt like other words.