Hi Ze Dang, Lena did some tests and found some errors which are not detected. Will these errors be detected in the future, is the errors not being detected a bug or by design, or are we testing wrong? Sorry for not having a better overview, but Lena and I have been distracted by other projects recently:
我很西欢喝咖啡。(should be: 我很喜欢喝咖啡。) - error not detected
金天天气不好,说以我不出去了。(should be: 今天天气不好,所以我不出去了。) - only second error is detected
Having binary files often means increased complexity. Just make sure the file only gets loaded once (e.g. by using a cache or making the variable static), that should be enough.
Copy files from language-zh.jar to target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT respectively.
Copy word_trigram.binary to LanguageTool-4.2-SNAPSHOT/org/languagetool/resource/zh too.
Modify the root(line 3) in target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/hanlp.properties to the correct directory. (In my case, it is /target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/org/languagetool/resource.)
3.Usage
Type java -jar languagetool-commandline -l zh-CN/zh-TW <text>.
Since ngram data is very big, the instruction may cause java heap space error. You need to add -Xmx option for it. In my case, -Xmx8000m works fine.
4.New Feature
1. Using ngram data to check Simplified Chinese text.
Have you tried what the minimum is? 8GB would be too much for use in production. Where exactly does the memory usage come from, do all ngrams get loaded to memory? If so, any plans to improve memory usage?
I think 8000m is nearly the minimum. Loading ngram file costs lots of memory. According to this page, BerkeleyLM usually take 4~6GB to read files. To improve memory usage is a main task of my third phase. I don’t have any good idea now. I need to do some research then I will post solutions here. If you have any good idea, please tell me. Thank you!
For ngrams, we use Lucene in some cases, as mentioned here: Finding errors using Big Data - LanguageTool Wiki - it means you need a fast hard disk (SSD), but memory usage will be very low, as only the index needs to be in memory. While Lucene is a fulltext search engine, we basically use it for look up: we provide the ngram as a search term, and get back its occurrence count. That plus some calculation and you have a very basic language model. Let me know if you need to know more.
I read the codes from languagetool/languagemodel. LuceneLanguageModel class can calculate the probability of a complete sentence or return the occurrence count of a series of words when using appropriate ngram data format. But in my codes, the ngram probabilities are calculated by a back-off model and saved in the ngram data. So, if I need to use Lucene, I need to write a helper class with Lucene?
So your question is how to build such a Lucene index, is that correct? You can check out org.languagetool.dev.bigdata.AggregatedNgramToLucene for some code that creates an index. Actually, all the classes that use Lucene’s IndexWriter do this.
So I need to compile your fork first with Maven, is that right? And then copy the two files into the result? For me, that doesn’t work yet, e.g. your second example doesn’t find an error yet. Any idea?
I created the jar by typing mvn package in language-module/zh directory. Then installation should be the same as last time. I followed the same steps as last time, it worked. Or you can try the following steps.
Download data.zip and then extract it to languagetool-standalone/target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/org/languagetool/resource.
Download word_trigram.binary and char_unigram.binary. Copy them to languagetool-standalone/target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/org/languagetool/resource/zh.
Thanks, that works for zh-TW. For zh-CN, I get: Exception in thread "main" java.lang.RuntimeException: Path zh/char_unigram.binary not found in class path at /org/languagetool/resource/zh/char_unigram.binary