Checking the white space character

I would like to write some advanced typographical rules for checking the white space used in some contexts.

To make the implementation usable by different languages, we should store the “white space character before” at the same time we store if “isWhitespaceBefore” in each token. A shared rule filter could be written that checks the white space character before, and each language could write its own XML rules that take advantage of the filter.

What do you think? Is this approach reasonable?

Could you maybe post some examples of such rules? I’m not sure if I understood your implementation idea.

The XML rule would be something like this:

<rule id="USE_NBSP" name="require non-breaking space">
    <pattern>
        <token regexp="yes">\d+</token>
        <token postag="Y"></token> <!--units-->
    </pattern>
    <filter class="org.languagetool.rules.checkWhitespaceFilter" args="whitespaceChar:&#160; position:2"/>
    <message>Use a non-breaking space between number and units.</message>
    <suggestion>\1&#160;\2</suggestion>
    <example correction="3&#160;km"><marker>3 km</marker></example>
</rule>

There are a few other contexts in which this kind of rules would be used. It will be very useful specially in French. See: Espace insécable — Wikipédia

Yes, that sounds like a good idea.

Done here in the simplest way: https://github.com/languagetool-org/languagetool/commit/463d06cd6b14cfbe42c12a68802ca2a3bffcf906