LanguageTool lang code details

Visioneer · August 23, 2016, 3:59am

Hi,
Below is my attempt to provide a “one of each” choice for “lang” code.
Is this complete and accurate?

I saw several other codes such as,
[{code: ‘ES’, name: ‘General’}, {code: ‘ES-Valencia’, name: ‘Valencian’}]

ast-ES, be-BY, br-FR, ca-ES, ca-ES-valencia, da-DK, de, de-AT, de-CH, de-DE, de-DE-x-simple-language, el-GR, en, en-AU, en-CA, en-GB, en-NZ, en-US, en-ZA,
eo, es, fa, fr, gl-ES, is-IS, it, ja-JP, km-KH, lt-LT, ml-IN, nl, pl-PL, pt, pt-BR, pt-PT, ro-RO, ru-RU, sk-SK, sl-SI, sv, ta-IN, tl-PH, uk-UA, zh-CN.

So for example, what is the difference between:
zh and zh-CN?
ca and ca-ES and ca-ES-Valencia?

What are all the Spanish possibilities?

(used <pre> for this)



English American
English British
English Canada
English Australia
English New Zealand
English South Africa

Auto-detect
Asturian
Belarusian
Breton

Catalan
Catalan Valencia

Chinese
Danish
Dutch
Esperanto
French
Galician

German Germany
German Austria
German Switzerland
Greek
Icelandic
Italian
Japanese
Khmer
Lithuanian
Malayalam
Persian
Polish

Portuguese Portugal
Portuguese Brazil

Romanian
Russian
Slovak
Slovenian

Spanish

Swedish
Tamil
Tagalog
Ukrainian

pep.bofarull · August 23, 2016, 6:41am

The Spanish keyboard includes of course de ñ (n with tilde) used only in Spanish language plus Ç, ç (c-cedilla), grave accent (à) and Interpunct (l·l) not used in Spanish language but used in others languages in Spain.
I think the code ‘ES’, name: ‘General’ is for Spanish languages or for Spanish computers. I remember a discussion in other forum about Occitan in ES code and FR code.
Jaume do you know that?

dnaber · August 23, 2016, 7:04am

There’s no difference. Specifying the country code (CN in this case) only makes a difference for languages that have special rules for that country variant. Typically, spell checking differs.

jaumeortola · August 23, 2016, 10:39am

ca or ca-ES is Catalan. ca-ES-valencia is the variant of Catalan spoken in Valencia, called Valencian. As most speakers of Catalan are in Spain, the country code is ES for both variants.

jaumeortola · August 23, 2016, 10:48am

Where did you get these ones? The language code is missing here in both: ca (Catalan).

ca-ES for Catalan (general), but most of the times we drop the adjective “general”
ca-ES-valencia for Catalan (Valencian)

Visioneer · August 29, 2016, 8:12am

I saw it with a “variable” bookmarklet on LT main page, which is how I learned of “Valencia”.
So I would add:
“ca > Catalan” and “ca-ES-Valencia > Catalan Valencia” to my “one of each” select shown
at the top of this post?
(the ca-ES) would not be necessary)?
Any other missing or incorrect code possibilities?

Thanks

Visioneer · August 29, 2016, 8:23am

I am a little surprised that there is only one Spanish (es) needed. There are so many different
Spanish spell checkers for different countries out there.

linuxscout · March 23, 2017, 12:01pm

Hello,
I am trying to add Arabic language module, the arabic language code is “ar”.
There are many countries codes for arabic for example ‘ar-DZ, ar-TN, ar-EG, ar-SA,’, all those codes have the same language tokenizer, dictionary and spellchecker, How can I configure them to in language module?
When I call the Spellchecker by ‘ar’ code, it ask me to give it a country code, How to do an alias for this
The forked LT project for arabic is on GitHub - linuxscout/languagetool: Style and Grammar Checker for 25+ Languages
Thanks

dnaber · March 23, 2017, 12:54pm

I’m not sure what you mean by that, what’s the exact message you get? In general, LT doesn’t care about the country codes unless there are differences in spelling. For example, for French there’s only fr without any country code, as the spelling dictionary is always the same for all the countries in which French is spoken.

linuxscout · March 24, 2017, 9:36pm

ok,
when I do tests with regression-test:
1- When I do
./regression-test.sh ar tests/tests 1000 semantic_errors

LT works and I get:
[INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 01:14 min [INFO] Finished at: 2017-03-24T22:19:09+01:00 [INFO] Final Memory: 44M/392M [INFO] ------------------------------------------------------------------------ 3.01kB 0:00:00 [40.3MB/s] [========================================================================================>] 100% Expected text language: Arabic (no spell checking active, specify a language variant like 'en-GB' if available) Working on STDIN...

2- When I use ar-DZ,
./regression-test.sh ar-DZ tests/tests 1000 semantic_errors
LI works, and I get

  `[INFO] ------------------------------------------------------------------------

[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 01:46 min
[INFO] Finished at: 2017-03-24T22:33:51+01:00
[INFO] Final Memory: 44M/382M
[INFO] ------------------------------------------------------------------------
3.01kB 0:00:00 [ 70MB/s] [========================================================================================>] 100%
java.lang.IllegalArgumentException: ‘ar-DZ’ is not a language code known to LanguageTool. Supported language codes are: ar, ast-ES, be-BY, br-FR, ca-ES, ca-ES-valencia, da-DK, de, de-AT, de-CH, de-DE, de-DE-x-simple-language, el-GR, en, en-AU, en-CA, en-GB, en-NZ, en-US, en-ZA, eo, es, fa, fr, gl-ES, it, ja-JP, km-KH, nl, pl-PL, pt, pt-AO, pt-BR, pt-MZ, pt-PT, ro-RO, ru-RU, sk-SK, sl-SI, sv, ta-IN, tl-PH, uk-UA, zh-CN. The list of languages is read from META-INF/org/languagetool/language-module.properties in the Java classpath. See Java API - LanguageTool Wiki for details.
`

My spell checker is Hunspell, and is configured for ar-DZ
How can I solve this problem

dnaber · March 25, 2017, 9:23am

That’s strange - does spell checking actually work or not? You might need to debug this, the message comes from org.languagetool.commandline.Main, line 441.