[GSoC] Idea - Improve Spelling Errors Detection for Chinese Language

t0iiz · March 2, 2018, 6:55am

Hi,

Yesterday, I found the detection and correction for Chinese Language in LT was not good enough. As a Chinese college student studying in AI who wants to participate in GSoC 2018, I am willing to share my witness to make LT better.

PROBLEM INTRODUCTION:
The object of spelling check in English is word, but “word” is not a clearly defined unit in Chinese, as there is no explicit word delimiter between words. In Chinese a word consists of characters, which are also known as “汉字”(character). Thus, the object of spelling check in Chinese is the characters in a sentence. Because the Chinese input method engine only allows the legal characters that have been stored in computer to be shown and input, the characters themselves in Chinese can never be misspelled like in English words. Therefore, Chinese spelling check requires deeper linguistic analysis.

Here are three example sentences for Chinese spelling error.

Right:            传统/美德                        好好/的/出去玩                    我/对/心理/研究/有/兴趣
Wrong:            穿/统/美德                       好好/地/出去玩                    我/对/心里/研究/有/兴趣
Pinyin:           chuan tong mei de               hao hao de chu qu wan            wo dui xin li yan jiu you xing qu
Translation:      traditional virtues             enjoy yourself outside           I’m interested in psychological research.
Title:            eg1                             eg2                              eg3

The example on the left shows a nonword error.
The middle example shows a single-character error.
The example on the right shows an error that the misused characters have been segmented into a legal word by chance.

SOLUTION OVERVIEW
I think we should unite several ways to solve the problems.

For nonword errors, we should use a word segmentation algorithm to build a directed acyclic graph from the input sentence. Then the spelling error detection and correction problem is transformed to the single-source shortest-path problem(SSSP).

However, the current algorithm to solve SSSP in LT for Chinese language basically ineffective.
This is the most urgent problem to be solvbed.
For single-character pronoun errors and other conference or collocation errors, we can set up a series of rules to solve.

Example: usage error for “她”(she)(pinyin:ta), “他”(he)(pinyin:ta)
This is a relatively easy work but it will be very helpful to the people who learn Chinese as a second/third language.
For the rest of the errors like eg2 and eg3, we should implement a supervised leaning approach like CRF to overcome this kind of disadvantages.

In eg2,the character “的”(of)(pinyin:de) should be corrected to “地”(-ly,adverb-forming particle)(pinyin:de)

In eg3,the word “心里”(in mind,at heart)(pinyin:xin li) will not be separated by any word segmenter, so “里”(pinyin:li) has no chance to be corrected to “理”(pinyin:li) without CRF.

dnaber · March 2, 2018, 8:58am

Thanks for the well-structured description of your idea. For a final proposal, these issues should also be addressed:

A roadmap with detailed planning, to make sure everything fits into three months.
How will the supervised learning approach be selected? Do you have experience with that?
A description of how the system will be trained and tested - do you have data for that? Where could you get it? Although LT collects some data, we don’t have many Chinese users, so our data will not be enough.

Also, the final proposal will need to be submitted at http://summerofcode.withgoogle.com. But that’s only possible starting 2018-03-12 (and the period ends 2018-03-27). To increase your chances of success, I suggest you start working on one of these issues now, so that you can already present a patch or prototype before 2018-03-27.

t0iiz · March 7, 2018, 1:17pm

Hi,
I read the following articles http://wiki.languagetool.org/developing-chinese-rules and http://wiki.languagetool.org/gsoc-2011.

I am writing my proposal now. I am a little confused that the Chinese checker now is basically rule based, but the idea I talked above is the machine learning based. I think it is very hard for me to fit everything into three months…That means I should abandon the ictclas4j(Chinese word splitter and POS tagger) then train a multi-layer machine learning based model for the Chinese checker.

Thus, I think the suitable idea is to use a n-gram model with the existing rule based method for the Chinese Spelling Check. Because the word segmentation method in Chinese and English is totally different. There should be a lot of word to do.

I also have the idea of how to use n-gram model to detect spelling errors for Chinese. I would like to show you a detailed description in my proposal in the next few days if you don’t mind.

dnaber · March 7, 2018, 2:04pm

Looking forward to your proposal! The last time LT took part in GSoC is already 7 years ago, we didn’t have any ngram or ML approaches back then. But the good news is that all approaches can work in parallel. There’s no reason to remove XML rules as long as they work. Even if an error gets found by two different approaches at the same time, that’s usually not something to worry about.

t0iiz · March 8, 2018, 2:32pm

Hi, I have already finished my proposal. Can you take a look and give me some advice?

Here is it. Proposal for GSoC2018

Thank you.

dnaber · March 8, 2018, 4:21pm

Thanks for the rather detailed proposal. Some things are still not clear to me, though:

“Rule based pattern matcher. It finds the grammatical errors.” - I’m not sure what “rule” is here. Is it a rule as in LT’s grammar.xml or is it something different?
It’s not so clear how exactly this fits into LT. For example:
- Will new external libraries be required?
- Can the dependency to ictclas4j be removed when this project is done?
Where will the ngram data and the dictionary you mention be taken from? Is it open source?
I’m not too happy with the approach of doing the integration at the end. I’d prefer to work in LT directly from day one (in your own fork), to avoid surprises later.

The Neural Network Model has done a great job of correcting sentences in some western language. However, it doesn’t do anything in the eastern language like Chinese, Japanese and Korean.

Could you explain this in more detail? Even though Chinese used “characters” in a different way from Western languages, it still has sequences that are more and less probable, doesn’t it? Otherwise an ngram model couldn’t do much.

t0iiz · March 9, 2018, 5:17am

Thanks for your advice! I will add all these things in my proposal and here is the answer.

Yes, the “Rule based pattern matcher” is a rule as in LT’s grammar.xml.
A library for converting Chinese to pinyin is needed. jpinyin is a good one we can use.
The key point for detection is separating the sentence to words first. So ictclas4j is still needed.
I have already found the dictionary and corpus. Both are open source, we can get the data there.

The process of inputting Chinese characters is shown as following. For example, I want to input the “传统”.

I type c-h-u-a-n these letters in order.
The keyboard sends the letters together to the input method editor(IME). Then, the editor shows all characters with the same pinyin chuan as option.
I type 1 to select the first character as output.
Take 1-3 for a loop by typing t-o-n-g.

As shown above

If I type 2 - “串” as output, the spelling error will occur.
This kind of error accounts for 70% of all.

So, if we use a NN model like seq2seq which is very popular in the field of the NLP.
For example

I love LanguegeTool.

This network will find the sequence L-a-n-g-u-e-g-e-T-o-o-l might have something wrong.
Because after millions times learning, it knows the first e must be replaced by a in this unique sequence.

However, it is not the same thing in Chinese.
For example

Wrong: 我正在地铁亦号线上。
Gold 我正在地铁一号线上。
pinyin wo zheng zai di tie yi hao xian shang
English I am on the Metro Line 1

The net work will find the character 亦 is wrong.
However, after millions times training. It have learned that 一号线(Line One) ,二号线(Line Two), 三号线(Line Three),四号线(Line Four)etc…all of them are right sequence.
Therefore, it can not make the right suggestion.

dnaber · March 9, 2018, 8:07am

Thanks for the explanation.

This is under GPL, so it cannot be used in LT, as it’s not a compatible license. You’d need to find an alternative (e.g. under LGPL, Apache License, BSD, etc - almost any open source license other than GPL and AGPL). You could also ask the authors whether they might change the license.

t0iiz · March 9, 2018, 8:59am

Ok, I find another more powerful library HanLP that is under Apache 2 License. It not only equips with all feature of ictclas4j but also others including the one we need. I think I should get familiar with HanLP now. If it is proved to be better than ictclas4j, we should consider to remove that one.

dnaber · March 9, 2018, 9:25am

That would be great, as ictclas4j isn’t maintained anymore.

t0iiz · March 9, 2018, 11:32am

That’s for sure. Is there anything I have to finish before the application period begins? And I also wonder that whether I should submit the proposal as soon as possible.

dnaber · March 9, 2018, 1:09pm

Nothing you have to do, but if you can already show progress, pull requests, or prototypes during the application period (2018-03-12 to 2018-03-27), that’s a big plus.

Yes, you can edit it until the application period ends.

dnaber · March 10, 2018, 8:44am

One more thing about the application that came to my mind: a good way to evaluate the result (and progress) is needed. Just manual “testing” is not enough - we’ll need some corpus of real errors to see whether we can detect them with your approach.

t0iiz · March 10, 2018, 9:03am

Sure. We can evaluate it by the data provided by SIGHAN.

t0iiz · March 11, 2018, 2:47pm

Hi, I have modified my proposal. I added reference and changed my schedule. Please take a look, if you don’t mind.
I am writing the prototype now. I think the result will be absolutely well. I will show it soon.

.

Yakov · March 13, 2018, 8:26pm

In the proposal is suggested to use n-gram data. Maybe using word2vec will be more effective?

t0iiz · March 14, 2018, 8:49am

I have considered and talked about it above. Since word2vec will make words into vector,the cos<vec1,vec2> can express the similarity betweed two words. This technique is often used for semantic analysis.

For example,

月亮<好像>弯弯的小船。(The moon is like a bent boat.)
月亮<宛若>弯弯的小船。(The moon is like a bent boat.)
The value cos<好像,宛若> is almost close to 1.

However, If there is a spelling error in the word 好像(pinyin:hao xiang,seem like), such as 好想(pinyin: hao xiang, want) or 好香(pinyin:hao xiang, good smile)
The value cos<好想,好像> will nearly equal to 0. Thus,we can never get one character to correct such mistake.