[sr] Dictionary naming convention

puramoca021 · November 27, 2017, 11:36pm

Hi,

I have a question regarding dictionary naming convention.

Serbian language has two main dialects (ekavian and jekavian) for which I generated appropriate POS and synthetic dictionaries. However, I noticed that currently there is no LT language module that in directory org/languagetool/resource/<language> has more than one pair of files: <language>_synth.dict and <language>_synth.info . For Hunspell dictionaries situation seems clear.

The question is: Is it possible to have more than one pair of dictionary files (*.dict and *.info) in directory org/languagetool/resource/sr ? If not, should *.dict file be generated to include all words from both dialects?

Thanks and regards,
Zoltan

dnaber · November 28, 2017, 9:49am

sr/serbian.dict is hard-coded in SerbianTagger.java, which is created from Serbian.java. If you have a subclass like EkavianSerbian, which extends Serbian but returns a different Tagger, that would work. Having these subclasses is what other languages do (although usually the spelling dict is different, not the POS tagging).

tiagosantos · November 28, 2017, 9:58am

Dialects are usually address on the country code, but in Serbian Cyrillic and Latin variants it is tougher.
You may have to go the long route, creating a separate language like Simple German (…/languagetool-language-modules/de-DE-x-simple-language).

Alternatively, given that they use different characters, you can mesh both languages. Use one huge POS tagger and synthesiser dictionary, one huge speller dictionaries and rule sets, like a usual country dialect. You can even separate the rule file (grammar.xml) in two. See the Ukranian module for a good example.

Anyway, the second option may have fewer conflicts when interacting with country codes in other programs.

P.S. - Apologies Daniel, haven’t seen your reply.

puramoca021 · November 28, 2017, 12:25pm

No, there is no Latin script anywhere. Entire LT support for Serbian language (and its dialects) uses Serbian Cyrillic alphabet only. Hence, two alphabets are not the issue.

If dialects are addressed on country level, then I think I did right: sr_RS should use Ekavian dictionary and sr_HR, sr_BA and sr_ME should use Jekavian dictionary.

If I understand correctly, the solution would be to derive different taggers from Serbian.java class and use them? If so, then the problem is solved. Please just confirm this is the right approach.

Big thanks,
Zoltan

dnaber · November 28, 2017, 12:33pm

Yes, but as you need to know which tagger is requested, you also need subclasses of Serbian.

puramoca021 · November 28, 2017, 12:58pm

Tagger selection is pretty straightforward - Ekavian tagger for sr_RS and Jekavian for other supported Serbian language codes. I will try and will shout if I get stuck.