Back to LanguageTool Homepage - Privacy - Imprint

Build n-gram for a new language


(Alex) #1

Hi everyone,
I'm a student from Vietnam. I'm trying to use LT to check spelling for Vietnamese.
My current result is like that.


Now, I can detect a word is error if it doesn't exit in my dictionary.
I also have a file grammar.xml for Vietnamese contains >700 rules.
Next, I want to use n-gram to detect real-word error. I mean grammar errors. I tried n-grams to detect error for English like this.
But I can't find anywhere have n-grams for my language. Anyone can tell me where I can download or how I can make n-grams for Vietnamese like en, de, fr, es. Which is supported by LT.

Finally, I'm so sorry for my bad english.

Thanks a lot for you help.

D.Duc


(Daniel Naber) #2

You will either need a huge amount of text to create the ngrams yourself. Or it might be easier to look for existing ngram sets. You might want to contact universities with linguistic research. A Google search for vietnamese ngram also gets you some results.


(Alex) #3

Dear Daniel,
Thanks a lot for your suggestion. But I still wonder how LT read my ngram set, have any requires about ngram sets. I mean the structure of ngram so it can be used by LanguageTool.
Here is what I see in 3grams directory


(Daniel Naber) #4

It's a Lucene index with a simple structure, one document per ngram. The ngram field is called ngram, the occurrence count field is called count. There's also a document with a field totalTokenCount that contains the total token count. You should be able to open the index and look inside using https://github.com/DmitryKey/luke.


(Alex) #5

Thank you very much. You helped me so lots.


(Alex) #6

Hi Daniel,

Now, I have a huge amount of text. (~2GB)
I saw a reply of you on the discussion below: http://forum.languagetool.org/t/en-n-gram-data/663


It means I can use Lucene 5.2.1 to create the index, doesn't it?
By the way, can you tell me next step to create the ngrams myself.

Thanks a lots.


(Daniel Naber) #7

Please see http://wiki.languagetool.org/finding-errors-using-big-data#toc3


(Alex) #8

Thank you :relaxed: