Chinese part development daily record

t0iiz · July 12, 2018, 9:18am

Add link above. You should download and copy it to the path which is the same as word_trigram.binary.

dnaber · July 12, 2018, 11:52am

Okay, it’s working now I think. It’s slow only because of the one-time setup, isn’t it? Have you checked the performance per sentence (e.g in “sentences per second”), not considering the setup time?

t0iiz · July 12, 2018, 12:17pm

Set up: about 7s.
Check: about 120 sentences per second.

t0iiz · July 18, 2018, 6:17am

GSoC Phase 3

Plan

Make ChineseNgramProbabilityRule available for zh-TW.
Optimaze the checking speed of the rule and lower memory usage.
Fix bugs.

dnaber · July 24, 2018, 2:43pm

What about memory usage, are you working on lowering that?

t0iiz · July 25, 2018, 7:09am

Yes.

dnaber · July 25, 2018, 7:11am

Great - please also remember to post short but daily reports here.

t0iiz · July 27, 2018, 5:53am

Hi, dnaber

I have make a comparison for my new rule with Lucene Based solution and BerkeleyLM Based solustion.

Name	Rule Setup Time	Sec per sentence	Memory Usage	Ngram Data Size
Lucene	8s	4s	2G	3.65G(Lucene index)
BerkerleyLM	3s	0.1s	4G	1.7G(hash based LM binary)

dnaber · July 27, 2018, 6:31am

What kind of hard disk did you use for this test? An SSD?

t0iiz · July 27, 2018, 6:45am

SSD. I trained language model again to improve accuracy and find a bug in my test code. The bug is that I actived ChineseNgramProbabilityRule in SimplifiedChinese then I created an instance of ChineseNgramProbabilityRule again. So the memory loads ngram data twice that it takes 8G to run.
After I fixed the bug, I can run java -jar languagetool-commandline -l zh-CN <text> without -Xmx8000m.

dnaber · July 27, 2018, 6:55am

How often are you running a lookup per sentence? I just wonder that Lucene is that much slower than BerkeleyLM.

t0iiz · July 27, 2018, 7:11am

In order to find each right character in the sentence, the rule will replace every char with chars in a confusion dictionary and calculate the prob of that sentence.(The right sentence is regarded as the max prob one.) So the longer the sentence is, the more query runs.

t0iiz · July 28, 2018, 2:44pm

July 28th

Ngram Rule supports zh-TW now.

Discussion
As I have said above, in order to find the replacement of error characters in my ngram rule, we can’t avoid quering. Unlike Enligh language which has only 26 characters, Chinese language has more than 7000 characters. So the size of query table is totally different.

I tried to make Lucene based way to run faster. But it turns that the fastest speed is 1800ms per sentence while Berkeley one is 80ms per sentence. Also I tried to train a smaller language model to make Berkeley one use fewer memory. However, results showed that though a smaller LM can reduce the memory usage, it greatly decreased checking accurancy.

What’s your idea?

dnaber · July 29, 2018, 6:52pm

I see. BerkeleyLM’s memory use might make it difficult to get this into production. Could you have both versions in the code, so one can switch between them (doesn’t need to be at runtime, a small code switch would be enough)?

dnaber · August 1, 2018, 7:48am

Ping… did you see my reply? I’d prefer if you could send a short daily report…

t0iiz · August 1, 2018, 8:03am

Sorry. I have added swich feature. I am writing tests to evaluate the whole system now. Find resources, preprocess the data set and then writing codes.

dnaber · August 2, 2018, 9:37am

When running with -adl (language auto detection), I get this error:

Exception in thread "main" java.lang.IllegalStateException: A language profile for language zh-CN was added already!
	at com.optimaize.langdetect.LanguageDetectorBuilder.withProfile(LanguageDetectorBuilder.java:146)
	at com.optimaize.langdetect.LanguageDetectorBuilder.withProfiles(LanguageDetectorBuilder.java:162)
	at org.languagetool.language.LanguageIdentifier.<init>(LanguageIdentifier.java:67)
	at org.languagetool.commandline.Main.detectLanguageOfString(Main.java:475)
	at org.languagetool.commandline.Main.runOnFile(Main.java:178)
	at org.languagetool.commandline.Main.main(Main.java:457)

Could you see if you can fix this?

t0iiz · August 2, 2018, 12:01pm

Fixed now. I have modified some codes(github) in LanguageIdentifier.java to make it work.

dnaber · August 2, 2018, 12:34pm

Thanks for the fast fix. ChineseNgramProbabilityRule.java still seems to have a hard coded path (C:\Dev\ngramDemo\data\test\index) so the tests fail for me, could you fix that, too?

t0iiz · August 2, 2018, 12:44pm

Yes. And I haven’t uploaded Lucene Index data. So you could not run the rule now.