I would like to add Tibetan support in languagetool. I have built hunspell rules for it on https://github.com/eroux/hunspell-bo and I think a good first step would be to add these into languagetool. The problem is that I cannot find any documentation on how to add these files into languagetools. How can I do this? Also, note that Tibetan script doesn’t use space between words, but tshegs (U+0F0B and U+0F0C) between syllables, so languagetools has to be able to handle this in order to support Tibetan, I don’t know if it’s the case?
Welcome to LanguageTool! Here’s a documentation on how to add a language. As LT is mostly about style and grammar checking, we wouldn’t want to add a language that supports only spell checking, though. To add style and grammar rules for a language without spaces, you’ll need a so-called tokenizer that splits the text into tokens. There might be Open Source tokenizers available for Tibetan already.
Thank you for your quick answer! Let’s forget about Tibetan then. Out of curiosity, what kind of Tokenizers do you use? I’ve worked on a Tokenizer for Tibetan in Lucene at https://github.com/BuddhistDigitalResourceCenter/lucene-bo I’m wondering if it could be used directly?
For most languages, the tokenizer is rather trivial. For Chinese, we use an Open Source library called cjftransform, for Japanese we use lucene-gosen. So a Lucene tokenizer cannot be used directly, but wrapping it for use in LT should be rather easy.