Back to LanguageTool Homepage - Privacy - Imprint

Chinese part development daily record


(Daniel Naber) #92

How often are you running a lookup per sentence? I just wonder that Lucene is that much slower than BerkeleyLM.


(Ze Dang) #93

In order to find each right character in the sentence, the rule will replace every char with chars in a confusion dictionary and calculate the prob of that sentence.(The right sentence is regarded as the max prob one.) So the longer the sentence is, the more query runs.


(Ze Dang) #94

July 28th

  • Ngram Rule supports zh-TW now.

Discussion
As I have said above, in order to find the replacement of error characters in my ngram rule, we can’t avoid quering. Unlike Enligh language which has only 26 characters, Chinese language has more than 7000 characters. So the size of query table is totally different.

I tried to make Lucene based way to run faster. But it turns that the fastest speed is 1800ms per sentence while Berkeley one is 80ms per sentence. Also I tried to train a smaller language model to make Berkeley one use fewer memory. However, results showed that though a smaller LM can reduce the memory usage, it greatly decreased checking accurancy.

What’s your idea?


(Daniel Naber) #95

I see. BerkeleyLM’s memory use might make it difficult to get this into production. Could you have both versions in the code, so one can switch between them (doesn’t need to be at runtime, a small code switch would be enough)?


(Daniel Naber) #96

Ping… did you see my reply? I’d prefer if you could send a short daily report…


(Ze Dang) #97

Sorry. I have added swich feature. I am writing tests to evaluate the whole system now. Find resources, preprocess the data set and then writing codes.


(Daniel Naber) #98

When running with -adl (language auto detection), I get this error:

Exception in thread "main" java.lang.IllegalStateException: A language profile for language zh-CN was added already!
	at com.optimaize.langdetect.LanguageDetectorBuilder.withProfile(LanguageDetectorBuilder.java:146)
	at com.optimaize.langdetect.LanguageDetectorBuilder.withProfiles(LanguageDetectorBuilder.java:162)
	at org.languagetool.language.LanguageIdentifier.<init>(LanguageIdentifier.java:67)
	at org.languagetool.commandline.Main.detectLanguageOfString(Main.java:475)
	at org.languagetool.commandline.Main.runOnFile(Main.java:178)
	at org.languagetool.commandline.Main.main(Main.java:457)

Could you see if you can fix this?


(Ze Dang) #99

Fixed now. I have modified some codes(github) in LanguageIdentifier.java to make it work.


(Daniel Naber) #100

Thanks for the fast fix. ChineseNgramProbabilityRule.java still seems to have a hard coded path (C:\Dev\ngramDemo\data\test\index) so the tests fail for me, could you fix that, too?


(Ze Dang) #101

Yes. And I haven’t uploaded Lucene Index data. So you could not run the rule now.


(Daniel Naber) #102

The data was too large, wasn’t it? Is there a way you can create a subset of it, just enough to make the tests work? If that’s not possible, how large was the data that’s still missing?


(Ze Dang) #103

About 8G. I am testing with different language models I trained to see which is the best. After that, I will upload my data.


(Ze Dang) #104

Summary of GSoC with LanguageTool

Author: Ze Dang

Email: 4649tz@gmail.com

Chinese is the most widely spoken language in the word. There are more and more Chinese language learner thanks to the the long history and special culture charm of China. Thus, I have worked to maintain and improve Chinese language in LanguageTool in the past three month.

If you have any difficulty or good idea, please post it here or send an email to me:)

Downloads

Github repository: https://github.com/hyousi/languagetool

Tokenization data: https://drive.google.com/open?id=1OMBIlXnBAelIAIT4pws85GZj1tt8vFNB

Trigram data

Note: LuceneIndex vs BerkeleyLM

Lucene Berkeley
Setup time 3s 9s
Memory Usage as normal 8G
Check speed(per sentence) 1s 27ms

Conclusion:

  • Lucene index slow down speed because the rule makes many queries for a sentence on the disk.
  • BerkeleyLM runs faster but uses much more memory.
  • Kenlm is smaller and faster than BerkeleyLM, but it is written by C++. Reference: https://kheafield.com/code/kenlm/benchmark/

Installation

  • Download codes from my github repo.
  • Run mvn install -DskipTests in root directory.
  • Download Tokenization Data. Extract it to languagetool/resource.
  • Choose the format you prefer with Trigram data. Download and extract it to languagetool/resource/zh.
  • Modify hanlp.properties in languagetool-standalone\target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT. Make root= to languagetool/resource.

TODO

  • Add more rules in grammar.xml.
  • Make ngram rule check faster. idea:
    • Rewrite the rule with other algorithm.
    • Implement kenlm in pure java.
    • Use JNI to call kenLM native functions.

(Daniel Naber) #105

I’m trying the latest version, but even though it works only with 6GB (which means I’m using BerkeleyLM, right?), it’s still slow. I start a server with this command:

java -Xmx6000m -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8081

Then I run some checks as a warmup. When I then profile with jvisualvm, the result looks like this:

readObjFile sounds to me as if something gets initialized over and over. Can you reproduce this?


(Ze Dang) #106

Thanks. I will fix it ASAP.


(Ze Dang) #107

It isn’t initialized over and over. Because word_trigram.binary is large, it takes much time.


(Daniel Naber) #108

LmReaders.readLmBinary in RuleHelper get called for every request. That means it takes several seconds even for short sentences. Can you add some caching there? See cache in, for example, ConfusionProbabilityRule for how we do that other parts ot LT.


(Ze Dang) #109

Done it. You can pull it now. And should I also add cache for unigram(531kb) and similarDictionary(237kb)?


(Daniel Naber) #110

Yes, please. Caching is important so short sentences are checked fast. Users often submit short text.


(Ze Dang) #111

Fixed it now.