[en GB] Adjusting DASH_RULE for British typography

paperboyo · October 16, 2020, 6:33am

Hello,

TL;DR Should English tokenizer split on hyphen?

I’m new to LanguageTool, thank you everyone for such a great tool and resource! It’s very possible I’m doing something wrong, but I have trouble modifying existing DASH_RULE to work according to what would be, to the best of my knowledge, correct British typography (Wikipedia is, obviously, far from authoritative, but repeats what is available elsewhere and is easy to link to).

I have no problem swapping em dashes for en dashes in all the rules of this group. I struggle, though, with adapting the one rule that deals with numerical ranges or time ranges. It should catch incorrectly spelled ranges (separated with hyphens or em dashes, either with spaces around them or without) and correct them all to be expressed with en dashes without spaces. I have no problem of correcting —, — and - to – (both spaced and unspaced em dashes and a spaced hyphen)…

One thing I cannot do is to replace an unspace hyphen (-) with an unspaced en dash. I’ve tried swapping - for &hyphen;, but it’s not accepted in <marker> inside <example>.

Simply, this (reduced test case):

            <rule>
                <pattern>
                    <token regexp='yes'>\d+</token>
                    <token regexp='yes'>-</token>
                    <token regexp='yes'>\d+</token>
                </pattern>
                <message>Consider using an en dash, if you want to indicate numerical ranges or time ranges.</message>
                <suggestion>\1–\3</suggestion>
                <short>Use an en dash.</short>
                <example correction='1901–1978'>Vitorino Nemésio (<marker>1901-1978</marker>) – writer and university teacher.</example>
            </rule>

doesn’t work . It doesn’t, that is, when I use English. It does, though, when I select Portuguese (original language for DASH_RULE).

Could it be the difference in Portuguese (here or here) versus English tokenizer (which tokenizes on en dash, but, somehow, not on hyphen)?

Do you think English tokenizer should be modified?
How would I modify it locally?
Is there another, better way?

I’m going in circles for hours… and that’s before trying to create a rule to replace hyphen with a proper minus (U+2212) in all mathematical formulas…

Thank you for any insights and pointing out my mistakes.

Regards
Mateusz

Mike_Unwalla · October 16, 2020, 7:32am

@paperboyo,

Try this:

<rule id="DASH_TEST" name="Dash test">
    <pattern>
        <token regexp='yes'>\d+-\d+</token>
    </pattern>
    <message>Consider using an en dash, if you want to indicate numerical ranges or time ranges.</message>
    <suggestion><match no="1" regexp_match="(\d+)-(\d+)" regexp_replace="$1–$2"/></suggestion>
    <short>Use an en dash.</short>
    <example correction='1901–1978'>Vitorino Nemésio (<marker>1901-1978</marker>) – writer and university teacher.</example>
    <example>Vitorino Nemésio (<marker>1901–1978</marker>) – writer and university teacher.</example>
    <example>Today's date is 2020-10-16.</example>
</rule>

Refer to Development Overview | dev.languagetool.org

paperboyo · October 16, 2020, 10:25am

I knew I’m missing something! Thanks, Mike!

Now, I need to think hard how to incorporate back days of the week and months ranges into this approach (also, from original):

                    <token regexp='yes'>\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;</token>
                    <token regexp='yes'>-|–</token>
                    <token regexp='yes'>\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;</token>

without writing multiple rules…

Is the approach from Portuguese tokenizer (split on hyphen unless in the dict) not appropriate in English? Or maybe bad for performance?

Mike_Unwalla · October 17, 2020, 9:14am

I don’t know. That is a question for one of the Java developers. Maybe @danielnaber can answer.

dnaber · October 17, 2020, 9:24am

I’m not sure, it would require touching quite some rules (and maybe also code) I think.

jaumeortola · October 17, 2020, 9:47am

In my experience in other languages it is the best approach. It is the current approach at least in Portuguese, Spanish, French and Catalan. Related to this, I opened recently this issue: [en] spelling suggestions for words with hyphen · Issue #3707 · languagetool-org/languagetool · GitHub.

It would require touching some rules, as Daniel says. I do not know how much effort it would involve.

paperboyo · October 17, 2020, 4:24pm

Thanks so much, all! I may add my use case description to the github issue. Because hyphen is the default entry from any keyboard/OS (unless one uses clever text editors swapping it for correct characters live), it can literally mean any type of dash, proper minus etc.

Here is my current modification of the English DASH_RULE. Haven’t yet tested it extensively and also need to ask around if people who actually can understand regex have any opinions (took me a good while to understand <suggestion> cannot include disjunctions (|) or entities).

Click to see rule code in full

        <rulegroup id="DASH_RULE" name='Dashes en-GB'>
            <!-- Created by Tiago F. Santos, 2017-01-23 -->
            <!-- Localised to English by Marco A.G.Pinto, 2017-04-02 -->
            <!-- Localised to English GB by paperboyo, 2020/10/17 -->
            <url>https://en.wikipedia.org/wiki/Dash#En_dash</url>
           <antipattern>
               <token regexp='yes'>-</token>
               <token regexp='yes'>-</token>
               <token regexp='yes'>-</token>
           </antipattern>
            <rule>
                <antipattern>
                    <!-- typical signature delimiter in emails -->
                    <token>-</token>
                    <token>-</token>
                </antipattern>
                <pattern>
                    <token postag='SENT_START'/>
                    <token min='0' regexp='yes'>["«»“”]</token>
                    <marker>
                        <token regexp='yes'>-|—</token>
                   </marker>
                </pattern>
                <message>Consider using an en dash in dialogues and enumerations.</message>
                <suggestion>–</suggestion>
                <short>Use an en dash.</short>
                <example correction='–'><marker>-</marker> What is that, mother?</example>
                <example correction='–'>« <marker>-</marker> What is that, mother?</example>
                <example>– It's your birthday present, my daughter.</example>
            <example>---------------------------------------</example>
            </rule>
            <rule>
                <antipattern>
                    <token regexp='yes'>\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;</token>
                    <token regexp='yes'>-|—</token>
                    <token regexp='yes'>\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;</token>
                </antipattern>
                <pattern>
                    <marker>
                        <token spacebefore="yes" regexp='yes'>-|—</token>
                    </marker>
                    <token spacebefore="yes"/>
                </pattern>
                <message>Consider using an en dash if you do not want to join two words.</message>
                <suggestion>–</suggestion>
                <short>Use an  en dash.</short>
                <example correction='–'>In these educational establishments there were enrollments <marker>-</marker> mostly from elementary school — and a total of teachers.</example>
                <example correction='-'>Institute Ricci de Macau <marker>-</marker> Association of cultural promotion of the Company of Jesus in Macau</example>
                <example>In the Midwest and Northwest portion are higher elevations, reaching 500 meters above sea level, highlighting Serra do Tumucumaque and Sierra Lombarda.</example>
            </rule>
            <rule>
				<antipattern><!-- XXX YYYY-XX-XX date formats and XX-XX-YYYY -->
					<token regexp="yes">\d{2}|\d{4}</token>
					<token regexp="yes" spacebefore='no'>–|-</token>
					<token regexp="yes" spacebefore='no'>\d{2}</token>
					<token regexp="yes" spacebefore='no'>–|-</token>
					<token regexp="yes" spacebefore='no'>\d{2}|\d{4}</token>
				</antipattern>
                <pattern>
                    <token regexp='yes'>\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;</token>
                    <token regexp='yes'>-|—</token>
                    <token regexp='yes'>\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;</token>
                </pattern>
                <message>Consider using an en dash (half dash), if you want to indicate numerical ranges or time ranges.</message>
                <suggestion>\1–\3</suggestion>
                <short>Use an en dash.</short>
                <example correction='1901 – 1978|1901–1978'>Vitorino Nemésio (<marker>1901 - 1978</marker>) – writer and university teacher.</example>
            </rule>
            <rule><!-- duplicate of the above rule to workaround current tokenizer’s hyphen behaviour, see https://github.com/languagetool-org/languagetool/issues/3707 -->
				<antipattern><!-- XXX YYYY-XX-XX date formats and XX-XX-YYYY -->
					<token regexp="yes">\d{2}|\d{4}</token>
					<token regexp="yes" spacebefore='no'>–|-</token>
					<token regexp="yes" spacebefore='no'>\d{2}</token>
					<token regexp="yes" spacebefore='no'>–|-</token>
					<token regexp="yes" spacebefore='no'>\d{2}|\d{4}</token>
				</antipattern>
				<pattern>
					<token regexp='yes'>(\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;)-(\d+|&months;|&abbrevMonths;|&weekdays;|&abbrevWeekdays;)</token>
				</pattern>
				<message>Consider using an en dash, if you want to indicate numerical ranges or time ranges.</message>
				<suggestion><match no="1" regexp_match="(\d+|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sept?|Oct|Nov|Dec|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday|Mon?|Tu|Tue|Tues|We|Wed|Weds|Th|Thu|Thur|Thurs|Fri?|Sat?|Sun?)-(\d+|January|February|March|April|May|June|July|August|September|October|November|December|Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sept?|Oct|Nov|Dec|Monday|Tuesday|Wednesday|Thursday|Friday|Saturday|Sunday|Mon?|Tu|Tue|Tues|We|Wed|Weds|Th|Thu|Thur|Thurs|Fri?|Sat?|Sun?)" regexp_replace="$1–$2"/></suggestion>
				<short>Use an en dash.</short>
				<example correction='1901–1978'>Vitorino Nemésio (<marker>1901-1978</marker>) – writer and university teacher.</example>
            </rule>
			<rule>
				<antipattern>
					<token regexp='yes'>-</token>
					<token regexp='yes'>-</token>
					<token regexp='yes'>-</token>
					<token regexp='yes'>,</token>
				</antipattern>
				<pattern>
					<marker>
						<token regexp='yes' spacebefore='yes'>-|—</token>
					</marker>
						<token spacebefore="no">,</token>
				</pattern>
				<message>If this is a parenthetical clause, finish it with an en dash.</message>
					<suggestion>–</suggestion>
						<short>Use an en dash.</short>
				<example correction='–'>…– as aforementioned <marker>-</marker>,…</example>
				<example>A better or more prestigious quality or status: A-, A or A+</example>
				<example>Sometimes the syndrome is divided into low-, medium- or high-functioning autism</example>
				<example>--------, Schopenhauer, The Human Character.</example>
			</rule>
        </rulegroup>

Not that I feel confident proposing it for en-GB formally (I don’t!), but just out of curiosity: what is the proper way for the language variant rule to override a language one? Or would every language variant (en-GB, en-CA) need to contain its own (even unmodified) version and it should be removed from general en?

Thank you again for all your help and for being so responsive. Have a nice weekend!

Mike_Unwalla · October 20, 2020, 7:27am

Put the rule in the grammar.xml file that is in the language subfolder. Thus, your rule goes in ..\org\languagetool\rules\en\en-GB\grammar.xml.

Yes.

Yes, if the general rule is applicable only to some language variants.