We’re receiving a lot of users’ suggestions for the German dictionary that include a hyphen, such as ‘SNOWBOARD-WM’.
AFAIK, most spell checkers treat the hyphen as a word separator. Is there a reason why we are not doing it that way? Treating the hyphen as part of the word certainly has some advantages, but the downside is a huge number of false positives. Any ideas?
There must be some special case, because Snowboard-WM
is accepted. Feel free to open an issue about that.
SNOWBOARD-WM came out oft the OCR-Tool, so I suppose it it has to do something with the “-” - maybe a different char encoding than the normal hyphen
Unfortunately most spellcheckers only do the treat-as-separate-words thing when no user-defined words are involved.
@dnaber That’s true. I tested the following four sentences:
- Die US-NOTENBANK sagt ja.
- Die US-Notenbank sagt ja.
- Eine andere NOTENBANK sagt nein.
- Die XCFG-Notenbank reagiert gelassen.
Only the first one yields an unexpected result, namely a spelling error. So apparently the gist is that the spell checker can’t handle hyphenated compounds written as all-caps. Do you think this is worth opening an issue?
Generally yes, but that doesn’t mean I’m going to work on it
Thank you, that is a very good suggestion. There are a lot of characters that are visually indistinguishable from the normal hyphen. But in this case I made sure that that is not the problem at hand by typing it out manually.
OK, let’s bury that. Now that I understand the underlying mechanism it seems far less important than I thought it was.
Could you please explain in a nutshell how I could address this problem in preprocessing the texts for the spellchecking. I think most of the uppercase words come out of my tool.
Thank you in advance
Hi all,
Just resuming this convo. I think hyphenated compounds are fine and Langtool works well on these. The issue I seem to be facing is on initial hyphens:
-Kann sein.
-Oder ihr macht es nie.
-Denkst du, das ist hier der Fall?
-Was?
Is there any rule we could set to waive the hyphen on first letter?
Thank you!
I was wondering about something similar.
Would it be possible to ignore ‘words’ that start with ellipsis+hyphen?
EG:
Turning on the TV she heard the 8 O’clock News news-anchor say “…-ven-Four-Seven has crashed at Charles du Gaulle. While the emergency services have already responded, there is little they can do unt-…” before she tuned in on her intended channel.