I think 8000m is nearly the minimum. Loading ngram file costs lots of memory. According to this page, BerkeleyLM usually take 4~6GB to read files. To improve memory usage is a main task of my third phase. I don’t have any good idea now. I need to do some research then I will post solutions here. If you have any good idea, please tell me. Thank you!
For ngrams, we use Lucene in some cases, as mentioned here: http://wiki.languagetool.org/finding-errors-using-big-data#toc3 - it means you need a fast hard disk (SSD), but memory usage will be very low, as only the index needs to be in memory. While Lucene is a fulltext search engine, we basically use it for look up: we provide the ngram as a search term, and get back its occurrence count. That plus some calculation and you have a very basic language model. Let me know if you need to know more.
I read the codes from
LuceneLanguageModel class can calculate the probability of a complete sentence or return the occurrence count of a series of words when using appropriate ngram data format. But in my codes, the ngram probabilities are calculated by a back-off model and saved in the ngram data. So, if I need to use Lucene, I need to write a helper class with Lucene?
So your question is how to build such a Lucene index, is that correct? You can check out
org.languagetool.dev.bigdata.AggregatedNgramToLucene for some code that creates an index. Actually, all the classes that use Lucene’s
IndexWriter do this.
So I need to compile your fork first with Maven, is that right? And then copy the two files into the result? For me, that doesn’t work yet, e.g. your second example doesn’t find an error yet. Any idea?
I created the jar by typing
mvn package in
language-module/zh directory. Then installation should be the same as last time. I followed the same steps as last time, it worked. Or you can try the following steps.
- Download codes from my github repository.
mvn install -DskipTestsin root directory.
- Download data.zip and then extract it to
- Download word_trigram.binary and char_unigram.binary. Copy them to
Thanks, that works for zh-TW. For zh-CN, I get:
Exception in thread "main" java.lang.RuntimeException: Path zh/char_unigram.binary not found in class path at /org/languagetool/resource/zh/char_unigram.binary
Add link above. You should download and copy it to the path which is the same as word_trigram.binary.
Okay, it’s working now I think. It’s slow only because of the one-time setup, isn’t it? Have you checked the performance per sentence (e.g in “sentences per second”), not considering the setup time?
Set up: about 7s.
Check: about 120 sentences per second.
GSoC Phase 3
- Make ChineseNgramProbabilityRule available for zh-TW.
- Optimaze the checking speed of the rule and lower memory usage.
- Fix bugs.
What about memory usage, are you working on lowering that?
Great - please also remember to post short but daily reports here.
I have make a comparison for my new rule with Lucene Based solution and BerkeleyLM Based solustion.
|Name||Rule Setup Time||Sec per sentence||Memory Usage||Ngram Data Size|
|BerkerleyLM||3s||0.1s||4G||1.7G(hash based LM binary)|
What kind of hard disk did you use for this test? An SSD?
SSD. I trained language model again to improve accuracy and find a bug in my test code. The bug is that I actived
SimplifiedChinese then I created an instance of
ChineseNgramProbabilityRule again. So the memory loads ngram data twice that it takes 8G to run.
After I fixed the bug, I can run
java -jar languagetool-commandline -l zh-CN <text> without
How often are you running a lookup per sentence? I just wonder that Lucene is that much slower than BerkeleyLM.
In order to find each right character in the sentence, the rule will replace every char with chars in a confusion dictionary and calculate the prob of that sentence.(The right sentence is regarded as the max prob one.) So the longer the sentence is, the more query runs.
- Ngram Rule supports zh-TW now.
As I have said above, in order to find the replacement of error characters in my ngram rule, we can’t avoid quering. Unlike Enligh language which has only 26 characters, Chinese language has more than 7000 characters. So the size of query table is totally different.
I tried to make Lucene based way to run faster. But it turns that the fastest speed is 1800ms per sentence while Berkeley one is 80ms per sentence. Also I tried to train a smaller language model to make Berkeley one use fewer memory. However, results showed that though a smaller LM can reduce the memory usage, it greatly decreased checking accurancy.
What’s your idea?