Thanks for your advice! I will add all these things in my proposal and here is the answer.
- Yes, the “Rule based pattern matcher” is a rule as in LT’s
- A library for converting Chinese to pinyin is needed.
jpinyin is a good one we can use.
- The key point for detection is separating the sentence to words first. So
ictclas4j is still needed.
- I have already found the dictionary and corpus. Both are open source, we can get the data there.
The process of inputting Chinese characters is shown as following. For example, I want to input the “传统”.
- I type
c-h-u-a-n these letters in order.
- The keyboard sends the letters together to the input method editor(IME). Then, the editor shows all characters with the same pinyin
chuan as option.
- I type
1 to select the first character as output.
1-3 for a loop by typing
As shown above
If I type
2 - “串” as output, the spelling error will occur.
This kind of error accounts for 70% of all.
So, if we use a NN model like seq2seq which is very popular in the field of the NLP.
I love LanguegeTool.
This network will find the sequence
L-a-n-g-u-e-g-e-T-o-o-l might have something wrong.
Because after millions times learning, it knows the first
e must be replaced by
a in this unique sequence.
However, it is not the same thing in Chinese.
pinyin wo zheng zai di tie yi hao xian shang
English I am on the Metro Line 1
The net work will find the character
亦 is wrong.
However, after millions times training. It have learned that
一号线(Line One) ,
四号线(Line Four)etc…all of them are right sequence.
Therefore, it can not make the right suggestion.