Creating a tagger dict for a new language

braincell · February 9, 2021, 3:40pm

I’m trying to create a tagger dict for a new language, following Developing a Tagger Dictionary | dev.languagetool.org. The language has high inflection, especially for verbs. One verb can take up to 100+ inflections, as they take different forms for 5 moods, 3 tenses, 3 persons, 2 numbers, 2 voices + some other inflections combined with pronouns.

I’ve devised some tags like:
VB - Verb, IND=Indicative, PS=Present, I - First person, S - Singular
Do I create a separate Tag for each combination like VB_IND_PS_I_S for inflection in Indicative, Present, first-person singular or there’s a way to combine them somehow?

Also, some parts of speech are multi-word entries or compound words, how do I represent them in the dict input file?

Your help is very much appreciated,

Thanks

Ruud_Baars · February 10, 2021, 8:47am

That is how it is usually done, using : as separator. With so many flexes, if these are regular, there may be software solutions, like creating or adding postags in the disambiguator.

Which language, by the way?

braincell · February 10, 2021, 2:46pm

Thank you. I noticed that in the German tagset, there were hundreds of combinations, though I don’t know anything German I noticed the pattern.
I already created a tree of inflections with “metadata”, so converting it to an input file which LT expects is not a complex matter with Python.

However, some parts are not clear. I’m trying to understand the disambiguator, it seems a bit complex.
Language is Albanian.
Example Verb:
“mendoj” (think) (Present active voice)

“mendova” (Aorist active voice)
“u mendova” (Aorist passive voice)

For Aorist tense, the difference is in the ‘u’ for passive voice. It looks like there should be a UNIFY rule in disambiguator for ‘u’ + VB:AORIST:PASSIVE which tags the two tokens with a potentially new tag(??)

Ruud_Baars · February 11, 2021, 2:51pm

Yes, rules like that coukd be made.