Build n-gram for a new language

Hi everyone,
I’m a student from Vietnam. I’m trying to use LT to check spelling for Vietnamese.
My current result is like that.


Now, I can detect a word is error if it doesn’t exit in my dictionary.
I also have a file grammar.xml for Vietnamese contains >700 rules.
Next, I want to use n-gram to detect real-word error. I mean grammar errors. I tried n-grams to detect error for English like this.
But I can’t find anywhere have n-grams for my language. Anyone can tell me where I can download or how I can make n-grams for Vietnamese like en, de, fr, es. Which is supported by LT.

Finally, I’m so sorry for my bad english.

Thanks a lot for you help.

D.Duc

You will either need a huge amount of text to create the ngrams yourself. Or it might be easier to look for existing ngram sets. You might want to contact universities with linguistic research. A Google search for vietnamese ngram also gets you some results.

Dear Daniel,
Thanks a lot for your suggestion. But I still wonder how LT read my ngram set, have any requires about ngram sets. I mean the structure of ngram so it can be used by LanguageTool.
Here is what I see in 3grams directory

It’s a Lucene index with a simple structure, one document per ngram. The ngram field is called ngram, the occurrence count field is called count. There’s also a document with a field totalTokenCount that contains the total token count. You should be able to open the index and look inside using GitHub - DmitryKey/luke: This is mavenised Luke: Lucene Toolbox Project.

Thank you very much. You helped me so lots.

Hi Daniel,

Now, I have a huge amount of text. (~2GB)
I saw a reply of you on the discussion below: En n-gram data


It means I can use Lucene 5.2.1 to create the index, doesn’t it?
By the way, can you tell me next step to create the ngrams myself.

Thanks a lots.

Please see Finding errors using Big Data - LanguageTool Wiki

Thank you :relaxed: