Adding Tibetan Hunspell files

eroux · July 12, 2017, 9:36am

I would like to add Tibetan support in languagetool. I have built hunspell rules for it on GitHub - eroux/hunspell-bo: Hunspell files for Tibetan (syllable level only) and I think a good first step would be to add these into languagetool. The problem is that I cannot find any documentation on how to add these files into languagetools. How can I do this? Also, note that Tibetan script doesn’t use space between words, but tshegs (U+0F0B and U+0F0C) between syllables, so languagetools has to be able to handle this in order to support Tibetan, I don’t know if it’s the case?

dnaber · July 12, 2017, 10:13am

Welcome to LanguageTool! Here’s a documentation on how to add a language. As LT is mostly about style and grammar checking, we wouldn’t want to add a language that supports only spell checking, though. To add style and grammar rules for a language without spaces, you’ll need a so-called tokenizer that splits the text into tokens. There might be Open Source tokenizers available for Tibetan already.

eroux · July 12, 2017, 10:27am

Thank you for your quick answer! Let’s forget about Tibetan then. Out of curiosity, what kind of Tokenizers do you use? I’ve worked on a Tokenizer for Tibetan in Lucene at GitHub - buda-base/lucene-bo: Lucene analyzer for Tibetan I’m wondering if it could be used directly?

dnaber · July 12, 2017, 11:13am

For most languages, the tokenizer is rather trivial. For Chinese, we use an Open Source library called cjftransform, for Japanese we use lucene-gosen. So a Lucene tokenizer cannot be used directly, but wrapping it for use in LT should be rather easy.