Checking the white space character

jaumeortola · December 14, 2019, 9:18am

I would like to write some advanced typographical rules for checking the white space used in some contexts.

To make the implementation usable by different languages, we should store the “white space character before” at the same time we store if “isWhitespaceBefore” in each token. A shared rule filter could be written that checks the white space character before, and each language could write its own XML rules that take advantage of the filter.

What do you think? Is this approach reasonable?

dnaber · December 14, 2019, 9:52am

Could you maybe post some examples of such rules? I’m not sure if I understood your implementation idea.

jaumeortola · December 14, 2019, 10:15am

The XML rule would be something like this:

<rule id="USE_NBSP" name="require non-breaking space">
    <pattern>
        <token regexp="yes">\d+</token>
        <token postag="Y"></token> <!--units-->
    </pattern>
    <filter class="org.languagetool.rules.checkWhitespaceFilter" args="whitespaceChar:&#160; position:2"/>
    <message>Use a non-breaking space between number and units.</message>
    <suggestion>\1&#160;\2</suggestion>
    <example correction="3&#160;km"><marker>3 km</marker></example>
</rule>

jaumeortola · December 14, 2019, 3:21pm

There are a few other contexts in which this kind of rules would be used. It will be very useful specially in French. See: Espace insécable — Wikipédia

dnaber · December 14, 2019, 6:12pm

Yes, that sounds like a good idea.

jaumeortola · December 16, 2019, 1:25pm

Done here in the simplest way: https://github.com/languagetool-org/languagetool/commit/463d06cd6b14cfbe42c12a68802ca2a3bffcf906