Chinese part development daily record

t0iiz · May 30, 2018, 12:56pm

May 30th

Add more comments for my work.
Search resources for Chinese ngram.

lena.feinbube · May 31, 2018, 10:46am

Thanks for the zip! The download took a while, but I’ll look into it

t0iiz · May 31, 2018, 12:45pm

May 31th

Downloaded corpus(zhwiki) and cleaned data.
Installed kenLM to train language model.

I plan to train two models character based and word based. I need to figure out which is better.

t0iiz · June 1, 2018, 2:13pm

June 1st

Training model.

t0iiz · June 5, 2018, 8:02am

June 5th

I trained the model with part of zhwiki corpus(about 1G). After that, I got 3 models.

Character based bi-gram, 34MB
Character based tri-gram, 230MB
Word based bi-gram, 160MB.

I think it is better to combine them together to detect errors.

Plan:

Implement the detection algorithm.

t0iiz · June 6, 2018, 12:49pm

June 6th

Evaluating Language Models

t0iiz · June 7, 2018, 2:34pm

June 6th

I have trained language models and write a python version detection to evaluate my idea and the models.
My idea is to combine bigram, trigram and 4gram to score every character in the sentence to figure out the potential error. It works good. Now I plan to start to implement it in java.

dnaber · June 7, 2018, 2:52pm

Do you have an evaluation, i.e. how “good” does it work?

t0iiz · June 8, 2018, 2:51pm

Tested in two approach.

Data from <category name="成语错误" id="CN_CAT2"> of grammar.xml. All the pieces passed.
Misspelling error subset data from SIGAN2015. Test with 28 pieces, 24 passed.

I read from LT code base today. Here is my question.

I can’t directly use Google ngram corpus . The Chinese ngram data are word based, but I need character based mainly. That means I need to train my own model as I did it before.
I read the codes from languagetool-core/src/main/java/org/languagetool/languagemodel/BaseLanguageModel.java. I didn’t find a way to read arpa files which includes ngram data.
edu.berkeley.nlp.lm provides a way to read that, but it is a little slow.

I want to know how should I implement this rule (detector and corrector). Can I call python script in java to get suggestion? Or I need to write a java version rule instead?

dnaber · June 8, 2018, 3:44pm

Calling Python is not really an option (other than for testing). The rule can, like any rule, just extend Rule and implement the match method I think. How exactly the rule gets its data is up to you, feel free to use the approach we use for ngram data, because its Lucene index is both fast and memory friendly. But if you find a better approach, let us know.

t0iiz · June 11, 2018, 7:26am

June 11th

berkeleylm provides a way to make arpa format language model to binary. That makes it faster.

Update skeleton codes of the new rule.

dnaber · June 11, 2018, 12:23pm

I was just going to try this, but it didn’t work. When I unzip this over LT 4.1 and replace all files except META-INF/org/languagetool/language-module.properties, I get java.lang.NoClassDefFoundError: com/hankcs/hanlp/utility/SentencesUtil. I could probably fix this myself, but maybe there’s an easy solution. Have you tried the JAR yourself?

t0iiz · June 11, 2018, 1:35pm

Fixed now.

dnaber · June 11, 2018, 1:44pm

So I can re-download the JAR from the same link and it’s a fixed version?

t0iiz · June 12, 2018, 11:26am

In my case, I copy the files from language-zh.jar to target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT respectively. Then I can use java -jar languagetool-commandline -l zh-CN/zh-TW <text> to check text.

It doesn’t work for you now?

dnaber · June 14, 2018, 9:28am

How can I use this? java -jar language-zh-4.2-SNAPSHOT-jar-with-dependencies.jar just prints Error: Could not find or load main class it.bitrack.main.Main for me.

t0iiz · June 15, 2018, 1:06pm

June 15th

Complete detection module.
Start to implement correction module.

t0iiz · June 20, 2018, 2:18pm

June 20th

Dealing with correction module.

t0iiz · June 21, 2018, 12:34pm

June 21th

Complete correction module. <— That means I have finished RuleMatch ruleMatch = new RuleMatch(this, sentence, startPos, endPos, ""); in match method.
Plan: Add suggestion next.

t0iiz · June 25, 2018, 2:29pm

June 25th

Start to write suggestion part.