Adding Tibetan Hunspell files

(Elie Roux) #1

I would like to add Tibetan support in languagetool. I have built hunspell rules for it on and I think a good first step would be to add these into languagetool. The problem is that I cannot find any documentation on how to add these files into languagetools. How can I do this? Also, note that Tibetan script doesn't use space between words, but tshegs (U+0F0B and U+0F0C) between syllables, so languagetools has to be able to handle this in order to support Tibetan, I don't know if it's the case?

(Daniel Naber) #2

Welcome to LanguageTool! Here's a documentation on how to add a language. As LT is mostly about style and grammar checking, we wouldn't want to add a language that supports only spell checking, though. To add style and grammar rules for a language without spaces, you'll need a so-called tokenizer that splits the text into tokens. There might be Open Source tokenizers available for Tibetan already.

(Elie Roux) #3

Thank you for your quick answer! Let's forget about Tibetan then. Out of curiosity, what kind of Tokenizers do you use? I've worked on a Tokenizer for Tibetan in Lucene at I'm wondering if it could be used directly?

(Daniel Naber) #4

For most languages, the tokenizer is rather trivial. For Chinese, we use an Open Source library called cjftransform, for Japanese we use lucene-gosen. So a Lucene tokenizer cannot be used directly, but wrapping it for use in LT should be rather easy.