Yesterday, I found the detection and correction for Chinese Language in LT was not good enough. As a Chinese college student studying in AI who wants to participate in GSoC 2018, I am willing to share my witness to make LT better.
The object of spelling check in English is word, but “word” is not a clearly defined unit in Chinese, as there is no explicit word delimiter between words. In Chinese a word consists of characters, which are also known as “汉字”(character). Thus, the object of spelling check in Chinese is the characters in a sentence. Because the Chinese input method engine only allows the legal characters that have been stored in computer to be shown and input, the characters themselves in Chinese can never be misspelled like in English words. Therefore, Chinese spelling check requires deeper linguistic analysis.
Here are three example sentences for Chinese spelling error.
Right: 传统/美德 好好/的/出去玩 我/对/心理/研究/有/兴趣 Wrong: 穿/统/美德 好好/地/出去玩 我/对/心里/研究/有/兴趣 Pinyin: chuan tong mei de hao hao de chu qu wan wo dui xin li yan jiu you xing qu Translation: traditional virtues enjoy yourself outside I’m interested in psychological research. Title: eg1 eg2 eg3
The example on the left shows a nonword error.
The middle example shows a single-character error.
The example on the right shows an error that the misused characters have been segmented into a legal word by chance.
I think we should unite several ways to solve the problems.
For nonword errors, we should use a word segmentation algorithm to build a directed acyclic graph from the input sentence. Then the spelling error detection and correction problem is transformed to the single-source shortest-path problem(SSSP).
However, the current algorithm to solve SSSP in LT for Chinese language basically ineffective.
This is the most urgent problem to be solvbed.
For single-character pronoun errors and other conference or collocation errors, we can set up a series of rules to solve.
Example: usage error for “她”(she)(pinyin:ta), “他”(he)(pinyin:ta)
This is a relatively easy work but it will be very helpful to the people who learn Chinese as a second/third language.
For the rest of the errors like eg2 and eg3, we should implement a supervised leaning approach like CRF to overcome this kind of disadvantages.
In eg2,the character “的”(of)(pinyin:de) should be corrected to “地”(-ly,adverb-forming particle)(pinyin:de)
In eg3,the word “心里”(in mind,at heart)(pinyin:xin li) will not be separated by any word segmenter, so “里”(pinyin:li) has no chance to be corrected to “理”(pinyin:li) without CRF.