Chinese part development daily record

May 30th

  • Add more comments for my work.
  • Search resources for Chinese ngram.

Thanks for the zip! The download took a while, but I’ll look into it :slight_smile:

May 31th

  • Downloaded corpus(zhwiki) and cleaned data.
  • Installed kenLM to train language model.

I plan to train two models character based and word based. I need to figure out which is better.

June 1st

  • Training model.

June 5th

I trained the model with part of zhwiki corpus(about 1G). After that, I got 3 models.

  1. Character based bi-gram, 34MB
  2. Character based tri-gram, 230MB
  3. Word based bi-gram, 160MB.

I think it is better to combine them together to detect errors.

Plan:

  • Implement the detection algorithm.

June 6th

  • Evaluating Language Models

June 6th

I have trained language models and write a python version detection to evaluate my idea and the models.
My idea is to combine bigram, trigram and 4gram to score every character in the sentence to figure out the potential error. It works good. Now I plan to start to implement it in java.

Do you have an evaluation, i.e. how “good” does it work?

Tested in two approach.

  1. Data from <category name="成语错误" id="CN_CAT2"> of grammar.xml. All the pieces passed.

  2. Misspelling error subset data from SIGAN2015. Test with 28 pieces, 24 passed.

I read from LT code base today. Here is my question.

  • I can’t directly use Google ngram corpus . The Chinese ngram data are word based, but I need character based mainly. That means I need to train my own model as I did it before.
  • I read the codes from languagetool-core/src/main/java/org/languagetool/languagemodel/BaseLanguageModel.java. I didn’t find a way to read arpa files which includes ngram data.
  • edu.berkeley.nlp.lm provides a way to read that, but it is a little slow.

I want to know how should I implement this rule (detector and corrector). Can I call python script in java to get suggestion? Or I need to write a java version rule instead?

Calling Python is not really an option (other than for testing). The rule can, like any rule, just extend Rule and implement the match method I think. How exactly the rule gets its data is up to you, feel free to use the approach we use for ngram data, because its Lucene index is both fast and memory friendly. But if you find a better approach, let us know.

June 11th

berkeleylm provides a way to make arpa format language model to binary. That makes it faster.

Update skeleton codes of the new rule.

I was just going to try this, but it didn’t work. When I unzip this over LT 4.1 and replace all files except META-INF/org/languagetool/language-module.properties, I get java.lang.NoClassDefFoundError: com/hankcs/hanlp/utility/SentencesUtil. I could probably fix this myself, but maybe there’s an easy solution. Have you tried the JAR yourself?

Fixed now.

So I can re-download the JAR from the same link and it’s a fixed version?

In my case, I copy the files from language-zh.jar to target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT respectively. Then I can use java -jar languagetool-commandline -l zh-CN/zh-TW <text> to check text.

It doesn’t work for you now?

How can I use this? java -jar language-zh-4.2-SNAPSHOT-jar-with-dependencies.jar just prints Error: Could not find or load main class it.bitrack.main.Main for me.

June 15th

  • Complete detection module.
  • Start to implement correction module.

June 20th

  • Dealing with correction module.

June 21th

  • Complete correction module. <— That means I have finished RuleMatch ruleMatch = new RuleMatch(this, sentence, startPos, endPos, ""); in match method.
  • Plan: Add suggestion next.

June 25th

  • Start to write suggestion part.