Simple language selection

jaumeortola · October 1, 2019, 4:19pm

The auto-detect language feature in the Chrome/Firefox add-on is useful, and it works quite well. But it makes a lot of mistakes in short sentences, and this can be very annoying in some contexts, for example in translation platforms, where most of the sentences are very short.

It would be very useful indeed to have a simple language selection option. Is it possible to implement it?

pep.bofarull · October 2, 2019, 6:59am

Dear Jaume. For me is correct the behavior of language auto detection and the possibility to change language when in the short sentences not mach correct. I’m not sure if your proposition is having the possibility to configure corrector for just a one language?

Ecron · October 2, 2019, 7:42am

Hi there, I’ve been using myself the Chrome extension and it’s amazing how it detects the language automatically and checks for mistakes. I have configured 4 different languages (Catalan, Spanish, English and Polish), and in long texts, there’s no chance for mistakes because at the end, languages are different one from another, so the autodetect functionality work as a charm.

Anyway, it happened some days ago that I started translating apps using the Weblate service. There, strings to be translated consists on sentences from 1 to 4 words (average). In this scenario, the autodetect feature gets crazy and detects the text in the text field as a language that’s not the correct one (of example, the string being “Color”, it’s detected as an English word and strikes it in red because it suggests “Colour”; “Finestra” is detected as an English word and it suggests “Fenestra”; “Descarrega paquets de pinzells del wiki del programa” detects it as Spanish and marks 4 words in red; etc.).

So, I’m with @jaumeortola in this topic: it would be great to be able to set a preferred language choice in the LT Extension configuration, so you can enable it or disable it when you need to check the spelling of short sentences in only one language among a great number of webs (like translating platforms).

Ruud_Baars · October 2, 2019, 9:24am

Colour is the right word for British English, by the way.

tiff · October 2, 2019, 12:44pm

Hi,

We are aware of these problems and are planning to improve this soon.
Something similar to a “preferred languages” feature will be implemented.

Currently, for the first 50 characters of a text it only tries to detect the languages that are likely spoken by the user (selected based on the languages of his country, language of the website, languages configured in his browser). I guess in your case @jaumeortola it struggles in the detection of Catalan vs Spanish because they are quite similar.

We use the common_words.txt of each language supported by LT:
Spanish: languagetool/common_words.txt at master · languagetool-org/languagetool · GitHub
Catalan: languagetool/common_words.txt at master · languagetool-org/languagetool · GitHub

It would be helpful to know the text for which a wrong language was detected for. So, next time it happens, please share your text with us, and we will try to improve the language detection.

jaumeortola · October 2, 2019, 1:48pm

Thanks for the answer.

I was not aware of the method used for detecting the language. I will look at the code.

Perhaps these lists for Spanish and Catalan can be improved a little. What word tokenization is used in the language detection?

I think that a few regexp rules could be very effective for differentiating Spanish and Catalan, more effective than these lists of words.

dnaber · October 2, 2019, 2:00pm

It’s a bit hacky… https://github.com/languagetool-org/languagetool/blob/master/languagetool-core/src/main/java/org/languagetool/language/CommonWords.java#L94-L96 (note this code only runs with FastText enabled but not providing enough reliability).

dnaber · October 2, 2019, 10:24pm

These would be very welcome (if coming as a separate class and with unit tests, it would be perfect).

pep.bofarull · October 3, 2019, 7:18am

I have taken a look the two lists and saw estrange words, numbers and punctuation. I think if is possible eliminate that, and the words wrote exactly in the two languages, the lists can improve dramatically. The regexp rules also will be welcome. In my case sometimes the confusion was not from Catalan to Spanish but from Catalan to Romanian.

dnaber · October 3, 2019, 11:29am

Items with numbers are ignored automatically already (here).

tiff · November 7, 2019, 11:42am

I noticed that the Catalan common_words.txt contains very Spanish words like “como” and “se” that according to catalandictionary.org | Open Source English-Catalan Dictionary Project don’t exist in the Catalan language. Can we regenerate the common_words.txt for Catalan with better input?

tiff · November 7, 2019, 11:45am

Same should be done for Galician, I guess.

jaumeortola · November 7, 2019, 12:26pm

“Como” is a possible verbal form in Catalan, but extremely unusual. It should be removed. “Se”, on the other hand, is a very common pronoun (a form of “es/se/s’…”).

I will generate a better list of words with the same word tokenization used here.

dnaber · November 7, 2019, 12:31pm

Please also make sure the list is about as long as it is now, to make sure no language gets a better chance of being detected just because there are more words for it.