TL;DR Should English tokenizer split on hyphen?
I’m new to LanguageTool, thank you everyone for such a great tool and resource! It’s very possible I’m doing something wrong, but I have trouble modifying existing DASH_RULE to work according to what would be, to the best of my knowledge, correct British typography (Wikipedia is, obviously, far from authoritative, but repeats what is available elsewhere and is easy to link to).
I have no problem swapping em dashes for en dashes in all the rules of this group. I struggle, though, with adapting the one rule that deals with numerical ranges or time ranges. It should catch incorrectly spelled ranges (separated with hyphens or em dashes, either with spaces around them or without) and correct them all to be expressed with en dashes without spaces. I have no problem of correcting
– (both spaced and unspaced em dashes and a spaced hyphen)…
One thing I cannot do is to replace an unspace hyphen (
-) with an unspaced en dash. I’ve tried swapping
‐, but it’s not accepted in
Simply, this (reduced test case):
<rule> <pattern> <token regexp='yes'>\d+</token> <token regexp='yes'>-</token> <token regexp='yes'>\d+</token> </pattern> <message>Consider using an en dash, if you want to indicate numerical ranges or time ranges.</message> <suggestion>\1–\3</suggestion> <short>Use an en dash.</short> <example correction='1901–1978'>Vitorino Nemésio (<marker>1901-1978</marker>) – writer and university teacher.</example> </rule>
doesn’t work . It doesn’t, that is, when I use English. It does, though, when I select Portuguese (original language for DASH_RULE).
- Do you think English tokenizer should be modified?
- How would I modify it locally?
- Is there another, better way?
I’m going in circles for hours… and that’s before trying to create a rule to replace hyphen with a proper minus (U+2212) in all mathematical formulas…
Thank you for any insights and pointing out my mistakes.