Characters to ignore?

dnaber · April 11, 2021, 8:57pm

I’ve added code to ignore a character, namely \uFEFF (zero-width non-breaking space). If you think other characters should be ignored when checking text, please let me know.

The commit was this one: new attempt to better deal with zero-width non-breaking space, namely… · languagetool-org/languagetool@046ca9e · GitHub

Ruud_Baars · April 12, 2021, 4:58pm

The soft hyphen is already ignored, isn’t it?

dnaber · April 12, 2021, 7:40pm

There are some special cases for the soft hyphen in some places in the code, yes. If you have a case where it should be ignored but it’s not, please let me know.

arysin · April 12, 2021, 9:01pm

Is there a reason why handling of “\uFEFF” is different than “\u00AD”?

dnaber · April 12, 2021, 9:03pm

There shouldn’t be, but when I tried to use the existing way to handle “\uFEFF”, some speller tests started to fail.

Ruud_Baars · April 13, 2021, 2:46pm

In general, it is in the middle of words that look okay, and are, apart from that. So when spell-checking, it should be ignored. For postagging as well.
Having words with that in the dictionary would not be great, better throw a warning for these when building a dictionary, just in case.