I have finished that. Since the member variable wordTokenizer in Chinese.java is private and the fact is that SC and TC use different word tokenizer, I think the code is quite ugly when they extend Chinese. Here is my code. What do you think?
And one more thing. I have already refactored current codes with the new library now. The existing grammar.xml only serves for SC and it conflicts with the new library. I will fix that next week. After fixing, I need to write an extra grammar.xml for TC. Do you have any advice?
P.S.
SC means Simplified Chinese.
TC means Traditional Chinese.
Is there a reason to still use Chinese or will everybody use the sub classes? If so, please make Chinese and its getWordTokenizer method abstract.
Please check English as an example, where we have a grammar.xml for en, en-US, and en-GB. The en-GB for example extends the en grammar.xml, so that all rules from en and en-GB will be active. Could we use the same for Chinese?
BTW, please indent your code with 2 spaces (not 4, and not tab), as we use the everywhere in LT.
I think grammar.xml for SC and TS are independent. That means neither of them should extend the other one and I need two grammar.xml separately. Because in order to express the same meaning, the characters are different in most cases. e.g. 电视 - 電視 (TV)
Or, maybe I can create 3 files. One for the common grammar cases. One for the SC special cases. One for the TC special cases.
Maybe you could still use the same approach as for English - it’s just that the Chinese grammar.xml would be empty and only SC and TC would actually have rules?
Your <message> has no <suggestion>, so it will not work, i.e. it will not create a suggestion for the user. You also cannot use \1 in <example>, you need to specify the complete suggestion there in the correction attribute.
[Discussion]
To make the output better for what I have talked above.
3
zh-CN
簡直是走投無路。
No error.
Correct sentence with TC characters
4
zh-CN
簡直是走頭無路。
No error.
TC characters sentence with an error 走投無路
I add a new feature when a user inputs TC sentences but decides to checks by zh-CN or the contrary situation. And I have mainly completed that with a tiny annoying issue.
For example. Say the feature named ChineseCharactersConversionRule.
No.
Option
Input
Suggestion
3
zh-CN
簡直是走投無路。
簡直_ -> 简直
走投無路 -> 走投无路
4
zh-CN
簡直是走頭無路。
簡直_ -> 简直
走頭無路 -> 走头无路
The tiny issue is that only if the user corrects the characters to SC and check again, will he get the result that 走头无路 should be 走投无路.
In my opinion, there are two solutions.
Abandon zh-CN’s grammar.xml and zh-TW’s grammar.xml. Combine them
together in the root folder. (I don’t like it actually. I think there must be some culture
collisions when only using one table.)
Directly tell the user that he chooses SC checker but inputs TC characters, he
should use the TC checker for his input. Then when he inputs and chooses correctly
the result should be fine. (This idea may be the safest one. We can avoid many potential
risks.). Mutual conversion between TC and SC is just a one-to-one mapping.
If a user selects the wrong checker by mistake. Rather than correct the characters, just
telling him to use another is a win-win situation for user and us.
How do you think so? Is there another way to solve the problem?
Do you think users will mix SC and TC input in one text? If not, we could maybe use our language identifier (GitHub - optimaize/language-detector: Language Detection Library for Java) and see if that can detect both variants reliably. This would be useful for all languages, e.g. someone might check German text but still has the setting on “English”. Currently we don’t give the user a useful hint in those cases.
I’ve been following this conversation and I’m impressed with your progress so far
Could you (in a free minute, no hurries) provide me with a current build of the version you are developing (mvn package)?
Then I can join in the testing efforts…
I test it in the design pattern that SimplifiedChinese class and TraditionalChinese class which have no getShortCode() method extend Chinese class since zh-CN and zh-TW use the same tokenizer, tagger and part of the grammar.xml. (You can see the codes in my github.
In this case, there is something different with other languages.
private static List<String> getLanguageCodes() {
List<String> langCodes = new ArrayList<>();
for (Language lang : Languages.get()) {
String langCode = lang.getShortCode();
// langCode will return zh for Chinese (Simplified) and Chinese (Traditional)
// lang.getShortCodeWithCountryAndVariant() returns zh-CN and zh-TW respectively.
boolean ignore = lang.isVariant() || ignoreLangCodes.contains(langCode) || externalLangCodes.contains(langCode);
if (ignore) { // **ignore will be true for zh-CN and zh-TW**
continue;
}
if ("zh".equals(langCode)) {
langCodes.add("zh-CN");
langCodes.add("zh-TW");
} else {
langCodes.add(langCode);
}
}
return langCodes;
After editing the codes above, the result shows that LanguageIdentifier.detectLanguage() only returns Chinese (Simplified) for Language reference type no matter what kind of Chinese characters I input.
However, when I change the languageDetector in LanguageIdentifier to public. languageDetector can totally identify TC and SC after 100 tests.
I am uploading it now. Link.
And you can also see my codes on github.If you download the codes from github. You have to download extra data in addition. Here is the link. Then you need to unzip it somewhere and modify root=G:/languagetool/languagetool-language-modules/zh/src/main/resources/org/languagetool/resource/ to the place you unzipped in resources/hanlp.properties.
Great, then let’s assume that future versions of LT will tell the users if they have selected the wrong variant or language (without the need for you to implement anything for that). As mentioned, this will also be useful for all other languages. But I cannot tell yet, when I’ll be able to implement that. It’s not that much work, but the UI will also be affected a bit.