Chinese part development daily record

dnaber · August 2, 2018, 1:19pm

The data was too large, wasn’t it? Is there a way you can create a subset of it, just enough to make the tests work? If that’s not possible, how large was the data that’s still missing?

t0iiz · August 2, 2018, 1:27pm

About 8G. I am testing with different language models I trained to see which is the best. After that, I will upload my data.

t0iiz · August 5, 2018, 9:56am

Summary of GSoC with LanguageTool

Author: Ze Dang

Email: 4649tz@gmail.com

Chinese is the most widely spoken language in the word. There are more and more Chinese language learner thanks to the the long history and special culture charm of China. Thus, I have worked to maintain and improve Chinese language in LanguageTool in the past three month.

If you have any difficulty or good idea, please post it here or send an email to me:)

Downloads

Github repository: GitHub - hyousi/languagetool: Style and Grammar Checker for 25+ Languages

Tokenization data: data.zip - Google Drive

Trigram data

(Default) Lucene index: trigram.zip - Google Drive
BerkeleyLM binary: word_trigram.binary - Google Drive

Note: LuceneIndex vs BerkeleyLM

	Lucene	Berkeley
Setup time	3s	9s
Memory Usage	as normal	8G
Check speed(per sentence)	1s	27ms

Conclusion:

Lucene index slow down speed because the rule makes many queries for a sentence on the disk.
BerkeleyLM runs faster but uses much more memory.
Kenlm is smaller and faster than BerkeleyLM, but it is written by C++. Reference: benchmark . kenlm . code . Kenneth Heafield

Installation

Download codes from my github repo.
Run mvn install -DskipTests in root directory.
Download Tokenization Data. Extract it to languagetool/resource .
Choose the format you prefer with Trigram data. Download and extract it to languagetool/resource/zh .
Modify hanlp.properties in languagetool-standalone\target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT. Make root= to languagetool/resource.

TODO

Add more rules in grammar.xml.
Make ngram rule check faster. idea:
- Rewrite the rule with other algorithm.
- Implement kenlm in pure java.
- Use JNI to call kenLM native functions.

dnaber · August 8, 2018, 10:59am

I’m trying the latest version, but even though it works only with 6GB (which means I’m using BerkeleyLM, right?), it’s still slow. I start a server with this command:

java -Xmx6000m -cp languagetool-server.jar org.languagetool.server.HTTPServer --port 8081

Then I run some checks as a warmup. When I then profile with jvisualvm, the result looks like this:

readObjFile sounds to me as if something gets initialized over and over. Can you reproduce this?

t0iiz · August 8, 2018, 12:08pm

Thanks. I will fix it ASAP.

t0iiz · August 9, 2018, 8:45am

It isn’t initialized over and over. Because word_trigram.binary is large, it takes much time.

dnaber · August 9, 2018, 10:08am

LmReaders.readLmBinary in RuleHelper get called for every request. That means it takes several seconds even for short sentences. Can you add some caching there? See cache in, for example, ConfusionProbabilityRule for how we do that other parts ot LT.

t0iiz · August 9, 2018, 1:07pm

Done it. You can pull it now. And should I also add cache for unigram(531kb) and similarDictionary(237kb)?

dnaber · August 9, 2018, 1:14pm

Yes, please. Caching is important so short sentences are checked fast. Users often submit short text.

t0iiz · August 9, 2018, 1:49pm

Fixed it now.