Chinese part development daily record

5/4/2018

  • Add new word tokenizer and write tests.

8/5/2018

  • Add new tag tokenizer.

Qusetion:
Is it fine that I put 1GB of dictionary and model data provided by the hanlp in the resource folder?

github doesn’t like such big files I think. When I last checked, 1GB was the total limit for the whole repo. So you should find another place for the files.

The language-module I write will use these data to check errors. So I need to find another way to upload these data. Is that right?

There is one more thing. Since I have finished refactoring the sentence tokenizer, word tokenizer and tagger. I can start working on zh\src\main\java\org\languagetool\language\Chinese.java. In my opinion, I want to split Chinese.java into SimplifiedChinese.java and TraditionalChinese.java.

For now, yes. Once everything is more stable (i.e. doesn’t often change), we can host the data on languagetool.org.

No problem. I guess Chinese should not be deleted so SimplifiedChinese and TraditionalChinese can extend it?

Yes, I think so. Now I’m trying to make them extend Chinese.java.

Hi dnaber

I have finished that. Since the member variable wordTokenizer in Chinese.java is private and the fact is that SC and TC use different word tokenizer, I think the code is quite ugly when they extend Chinese. Here is my code. What do you think?

And one more thing. I have already refactored current codes with the new library now. The existing grammar.xml only serves for SC and it conflicts with the new library. I will fix that next week. After fixing, I need to write an extra grammar.xml for TC. Do you have any advice?

P.S.
SC means Simplified Chinese.
TC means Traditional Chinese.

Is there a reason to still use Chinese or will everybody use the sub classes? If so, please make Chinese and its getWordTokenizer method abstract.

Please check English as an example, where we have a grammar.xml for en, en-US, and en-GB. The en-GB for example extends the en grammar.xml, so that all rules from en and en-GB will be active. Could we use the same for Chinese?

BTW, please indent your code with 2 spaces (not 4, and not tab), as we use the everywhere in LT.

Thank you

I think grammar.xml for SC and TS are independent. That means neither of them should extend the other one and I need two grammar.xml separately. Because in order to express the same meaning, the characters are different in most cases. e.g. 电视 - 電視 (TV)

Or, maybe I can create 3 files. One for the common grammar cases. One for the SC special cases. One for the TC special cases.

Maybe you could still use the same approach as for English - it’s just that the Chinese grammar.xml would be empty and only SC and TC would actually have rules?

May 20th

Complete the first part of my proposal.
Feature

  • Now LT can check both Simplified Chinese and Traditional Chinese now.

Command-line Usage

  • Check Simplified Chinese text with the option -l zh-CN.
  • Check Traditional Chinese text with the option -l zh-TW.

e.g.

No. Option Input Output Description
1 zh-CN 简直是走投无路。 No error. Correct sentence.
2 zh-CN 简直是走头无路。 走头无路 ->走投无路 ConfusionProbabilityRule
3 zh-CN 簡直是走投無路。 No error. Correct sentence with TC characters
4 zh-CN 簡直是走頭無路。 No error. TC characters sentence with an error 走投無路

And vice versa for zh-TW.

[Question:]
How can I write rules.xml message, suggestion or correction to substitute part of characters in a token?
For example.

<rule>
    <pattern>
        <marker>
            <token regexp="yes">(夜|春|通|元)(霄)</token>
        </marker>
    </pattern>
    <message>您的意思是"\1宵"吗?</message>
    <example correction="\1宵">然而,这顿<marker>夜霄</marker>吃得并不开心。</example>
    <example correction="\1宵">今天我准备<marker>通霄</marker>。</example>
</rule>

The rule above didn’t work actually. Is there any codes that can represent something like \1宵?

Your <message> has no <suggestion>, so it will not work, i.e. it will not create a suggestion for the user. You also cannot use \1 in <example>, you need to specify the complete suggestion there in the correction attribute.

I mean I want to know is there a syntax in correction or <suggestion> that using a symbol to represent a specified group in a token?

Have you tried regexp_match and regexp_replace? (documentation)

Thank you!

May 27th

[Discussion]
To make the output better for what I have talked above.

3 zh-CN 簡直是走投無路。 No error. Correct sentence with TC characters
4 zh-CN 簡直是走頭無路。 No error. TC characters sentence with an error 走投無路

I add a new feature when a user inputs TC sentences but decides to checks by zh-CN or the contrary situation. And I have mainly completed that with a tiny annoying issue.

For example. Say the feature named ChineseCharactersConversionRule.

No. Option Input Suggestion
3 zh-CN 簡直是走投無路。 簡直_ -> 简直
走投無路 -> 走投无路
4 zh-CN 簡直是走頭無路。 簡直_ -> 简直
走頭無路 -> 走头无路

The tiny issue is that only if the user corrects the characters to SC and check again, will he get the result that 走头无路 should be 走投无路.

In my opinion, there are two solutions.

  1. Abandon zh-CN’s grammar.xml and zh-TW’s grammar.xml. Combine them
    together in the root folder. (I don’t like it actually. I think there must be some culture
    collisions when only using one table.)

  2. Directly tell the user that he chooses SC checker but inputs TC characters, he
    should use the TC checker for his input. Then when he inputs and chooses correctly
    the result should be fine. (This idea may be the safest one. We can avoid many potential
    risks.). Mutual conversion between TC and SC is just a one-to-one mapping.
    If a user selects the wrong checker by mistake. Rather than correct the characters, just
    telling him to use another is a win-win situation for user and us.

How do you think so? Is there another way to solve the problem?

Do you think users will mix SC and TC input in one text? If not, we could maybe use our language identifier (GitHub - optimaize/language-detector: Language Detection Library for Java) and see if that can detect both variants reliably. This would be useful for all languages, e.g. someone might check German text but still has the setting on “English”. Currently we don’t give the user a useful hint in those cases.

I don’t think they will do that.

Then could you try if org.languagetool.language.LanguageIdentifier can reliably tell SC and TC apart?

What’s wrong when I create an instance of LanguageIdentifier?

java.lang.IllegalStateException: A language profile for language zh-CN was added already!