Chinese part development daily record

Hi Ze Dang, Lena did some tests and found some errors which are not detected. Will these errors be detected in the future, is the errors not being detected a bug or by design, or are we testing wrong? Sorry for not having a better overview, but Lena and I have been distracted by other projects recently:

  1. 我很西欢喝咖啡。(should be: 我很喜欢喝咖啡。) - error not detected
  2. 金天天气不好,说以我不出去了。(should be: 今天天气不好,所以我不出去了。) - only second error is detected
  3. 我门是好朋友。(should be: 我们是好朋友。)- error not detected

This are misspelling errors which should be detected by ngram. I am developing it now.

Thanks! Could you elaborate a little on which error classes should already be detected, and which are still on the todo list?

For Simplified Chinese:

  • ChineseConfusionProbabilityRule (It works as the old version.)
  • TODO: NgramRule (Developing now)

For Traditional Chinese:

  • ChineseConfusionProbabilityRule
  • AmbiguityRule (Complete now)
    e.g. “公曆” which is converted from “公历” can be corrected into “西曆”.

June 28th

  • Complete NgramProbabilityRule for zh-CN (The main task of my proposal).

Feature:
This rule can detect misspelling errors and give suggestions for a zh-CN characters composed sentences at the same time.

e.g.

Input Output
我很西欢喝咖啡。 我很喜欢喝咖啡。
金天天气不好,说以我不出去了。 今天天气不好,所以我不出去了。
我门是好朋友。 我们是好朋友。

TODO:


  • Since suggestions are now only considered in characters with similar pronunciation(pinyin).
    • Extend it with characters with similar shape.
  • Suggestions are loaded from a txt file then using a HashMap<String, List> to save it.
    • Convert the txt file to binary to make reading faster.
  • Optimize the algorithm to detect faster.
  • Make it available for zh-TW.

How large is that text file and how long does loading it take? In many cases, we just cache such files in LT to make sure it’s just loaded once.

About 300kb.

Having binary files often means increased complexity. Just make sure the file only gets loaded once (e.g. by using a cache or making the variable static), that should be enough.

Thank you.

July 4th

  • Extend NgramRule suggestion table with character shape infomation.
  • Optimize algorithm to detect faster.

TODO

  • Make an evaluation of the algorithm(precision, accuracy, recall etc.)
  • Make it avaliable for zh-TW:

Could you please post an update here with examples of what’s detected now in the latest version (like the one you sent via private message)?

GSoC Phase 2 Evaluation

1. Deliverable

jar(containing data.zip): https://drive.google.com/file/d/128sydaJzrji1-JRI8nsUAtoSTZWobPB2/view?usp=sharing
word_trigram.binary: https://drive.google.com/file/d/1ImtawiTxXzt2-YcPa1iWZ-uc1YERQITr/view?usp=sharing
char_unigram.binary: https://drive.google.com/file/d/1XcBdrKeWieAApNLCHhU4d-mUMfrL7iRg/view?usp=sharing
data.zip: data.zip - Google Drive

2. Installation

  • Download jar and data.
  • Copy files from language-zh.jar to target\LanguageTool-4.2-SNAPSHOT\LanguageTool-4.2-SNAPSHOT respectively.
  • Copy word_trigram.binary to LanguageTool-4.2-SNAPSHOT/org/languagetool/resource/zh too.
  • Modify the root(line 3) in target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/hanlp.properties to the correct directory. (In my case, it is /target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/org/languagetool/resource.)

3.Usage

Type java -jar languagetool-commandline -l zh-CN/zh-TW <text>.
Since ngram data is very big, the instruction may cause java heap space error. You need to add -Xmx option for it. In my case, -Xmx8000m works fine.

4.New Feature

1. Using ngram data to check Simplified Chinese text.

e.g.

file_name: 1.txt
content: 我门是好朋友。
# correct: 我们是好朋友。

before: No errors.
now: 门 → 们

file_name: 2.txt
content: 假书抵万金。
# correct: 家书抵万金。

before: No errors.
now: 假 → 家

file_name: 3.txt
content: 戎边的战士门真的很辛苦。
# correct: 戍边的战士们真的很辛苦。

before: 戎边 → 戍边
now: 戍 → 龙, 戎边 → 戍边, 门 → 们

2. Using a word dictionary to check Traditional Chinese text.

file_name: 1.txt
content: 我的打印機壞了。
# correct: 我的印表機壞了。

before: no errors.
now: 打印機 → 印表機

file_name: 2.txt
content: 他的互聯網公司解散後,生計並無著落,簡直是走頭無路。
# correct: 他的網際網路公司解散後,生計並無著落,簡直是走投無路。

before: 走頭無路 → 走投無路
now: 互聯網 → 網際網路, 走頭無路 → 走投無路

Thanks!

Have you tried what the minimum is? 8GB would be too much for use in production. Where exactly does the memory usage come from, do all ngrams get loaded to memory? If so, any plans to improve memory usage?

I think 8000m is nearly the minimum. Loading ngram file costs lots of memory. According to this page, BerkeleyLM usually take 4~6GB to read files. To improve memory usage is a main task of my third phase. I don’t have any good idea now. I need to do some research then I will post solutions here. If you have any good idea, please tell me. Thank you!

For ngrams, we use Lucene in some cases, as mentioned here: Finding errors using Big Data - LanguageTool Wiki - it means you need a fast hard disk (SSD), but memory usage will be very low, as only the index needs to be in memory. While Lucene is a fulltext search engine, we basically use it for look up: we provide the ngram as a search term, and get back its occurrence count. That plus some calculation and you have a very basic language model. Let me know if you need to know more.

I read the codes from languagetool/languagemodel. LuceneLanguageModel class can calculate the probability of a complete sentence or return the occurrence count of a series of words when using appropriate ngram data format. But in my codes, the ngram probabilities are calculated by a back-off model and saved in the ngram data. So, if I need to use Lucene, I need to write a helper class with Lucene?

So your question is how to build such a Lucene index, is that correct? You can check out org.languagetool.dev.bigdata.AggregatedNgramToLucene for some code that creates an index. Actually, all the classes that use Lucene’s IndexWriter do this.

So I need to compile your fork first with Maven, is that right? And then copy the two files into the result? For me, that doesn’t work yet, e.g. your second example doesn’t find an error yet. Any idea?

I created the jar by typing mvn package in language-module/zh directory. Then installation should be the same as last time. I followed the same steps as last time, it worked. Or you can try the following steps.

  • Download codes from my github repository.
  • Run mvn install -DskipTests in root directory.
  • Download data.zip and then extract it to languagetool-standalone/target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/org/languagetool/resource.
  • Download word_trigram.binary and char_unigram.binary. Copy them to languagetool-standalone/target/LanguageTool-4.2-SNAPSHOT/LanguageTool-4.2-SNAPSHOT/org/languagetool/resource/zh.

Thanks, that works for zh-TW. For zh-CN, I get: Exception in thread "main" java.lang.RuntimeException: Path zh/char_unigram.binary not found in class path at /org/languagetool/resource/zh/char_unigram.binary