Speller options to detect words with numbers and UPPERCASE/CamelCase words

DArek · October 15, 2018, 7:00pm

I know that it is possible to configure speller for some languages to not ignore words with numbers and UPPERCASE and CamelCase words with fsa.dict.speller options:

fsa.dict.speller.ignore-all-uppercase=false
fsa.dict.speller.ignore-camel-case=false
fsa.dict.speller.ignore-numbers=false

I can implement such changes for most languages, but Danish, German, Portuguese and Ukrainian. Is it possible to not ignore such words for these languages?

For Russian I see that words with Latin characters are ignored. Is it possible to configure the Russian speller to not ignore Latin words?

dnaber · October 15, 2018, 7:58pm

Do you really mean to not ignore them? Because for German, a made-up word like “SpulMasdef” is already not ignored, i.e. it is detected as an error.

DArek · October 15, 2018, 8:58pm

Actually, for German I only lack the option not to ignore words with numbers.

dnaber · October 15, 2018, 9:22pm

I can’t think of an option for that. The problem with numbers in German is that they get filtered out by some internal tokenization (nonWordPattern in HunspellRule). Maybe someone else has a better idea, but I think you might need to debug the LT code and make changes to it.

Ruud_Baars · October 16, 2018, 5:47am

These options have a large impact, so please do not touch them for Dutch. CamelCase words are often proper names that are often written without CamelCase, a good reason to have them in the speller. Same applies to words like F16 (the plane). For allupper, it is not a problem, except for longer words; for which there is a rule.

DArek · October 16, 2018, 7:16am

Do not worry. My question was not a request for a change, but I just wanted to know if the end user himself can change the spell-checker behavior.

I’m preparing a new version of the plugin for SDL Trados Studio and it would be useful for me not to ignore some mispellings, because the detected mispellings will be further processed and compared with the source in the bilingual text. So I don’t worry about too many false positives.

After all, I think that the most efficient way to catch words with numbers and in uppercase letters will be to create dedicated rules for each language.
However It would be nice to include the default speller behavior in languages overview table.

Ruud_Baars · October 16, 2018, 7:47am

Okay, I understand. I was a bit worried, because I noticed someone, at some time, switched off a lot of rules I sweated on, without any comments in the XML.

arysin · October 16, 2018, 5:08pm

I would think fsa.dict* options are common for all languages that use morfologik speller. Not sure why it can’t be changed for Ukrainian then. Did you find those options don’t work for Ukrainian?
Note: we also have some special handling of such words in tagging and grammar rules, so not sure how much useful changing just the speller would be.

DArek · October 17, 2018, 5:44am

Actually, I did not find these options for the Ukrainian. I will have to manage in a different way, probably with using rules. I need rules to detect Latin-character words, words with numbers, UPPERCASE words and camelCase words. I found there is rule for camelCase words.