How to find the tab character?

Mike_Unwalla · June 8, 2019, 6:29pm

I would like to make rules that find the tab character in some contexts. Is it possible, and if yes, how?

This rule partly works:

<rule id="TAB_CHARACTER" name="Find a tab character">
    <regexp>&#x9;</regexp>
    <message>Found a tab</message>
    <short>Tab</short>
    <example correction="">A tab character: a<marker>&#x9;</marker>b between a and b.</example>
    <example>No tab character.</example>
</rule>

But, testrules gives this warning:

Running pattern rule tests for English... Exception in thread "main" java.lang.AssertionError: English rule TAB_CHARACTE
R[1]:
"A tab character: ab between a and b."
Errors expected: 1
Errors found   : 0
Message: Found a tab
Analyzed token readings: [/SENT_START*] A[a/DT*,B-NP-singular]  [ /null*] tab[tab/NN,I-NP-singular]  [ /null*] character
[character/NN:UN,E-NP-singular] :[:/:*,O]  [ /null*] ab[ab/null,B-NP-singular|E-NP-singular]  [ /null*] between[between/
IN,B-PP]  [ /null*] a[a/NNP,B-NP-singular]  [ /null*] and[and/CC,I-NP-singular]  [ /null*] b[b/null,E-NP-singular] .[./.
*,./SENT_END*,O]
Matches: []
Regexp:
        at org.junit.Assert.fail(Assert.java:88)

Also, the right-click menu does not work fully.

W3C Schools (W3Schools Tryit Editor) shows 3 ways to represent the tab character in HTML. If I use &Tab;, testrules gives this warning:

Exception in thread "main" java.io.IOException: Cannot load or parse '/org/languagetool/rules/en/grammar.xml'
        at org.languagetool.XMLValidator.validateWithXmlSchema(XMLValidator.java:123)
        at org.languagetool.rules.patterns.PatternRuleTest.validatePatternFile(PatternRuleTest.java:214)
        at org.languagetool.rules.patterns.PatternRuleTest.runTestForLanguage(PatternRuleTest.java:146)
        at org.languagetool.rules.patterns.PatternRuleTest.runGrammarRulesFromXmlTestIgnoringLanguages(PatternRuleTest.j
ava:141)
        at org.languagetool.rules.patterns.PatternRuleTest.main(PatternRuleTest.java:579)
Caused by: org.xml.sax.SAXParseException; lineNumber: 118; columnNumber: 26; The entity "Tab" was referenced, but not de
clared.
        at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)

TheICTLiker4 · July 8, 2019, 6:22pm

Just press the Tab key on the keyboard and you’ll be fine.

Mike_Unwalla · October 23, 2020, 1:12pm

@udomai, update 2020-10-23

I changed the rule to use the Unicode code point: Unicode Character 'CHARACTER TABULATION' (U+0009). Also, I used tokens rather than regexp. Unicode code points work fine in NON_STANDARD_ALPHABETIC_CHARACTERS.

<rule id="TAB_CHARACTER2" name="Find a tab character">
    <pattern>
        <token/>
        <token regexp="yes">\b(\u0009|\u043E)\b</token>
        <token/>
    </pattern>
    <message>Found a tab v2</message>
    <short>Tab</short>
    <example correction="">Cyrillic small <marker>letter о is</marker> found.</example>
    <example correction="">Tab character between <marker>two	words</marker>.</example>
    <example>No tab character.</example>
</rule>

Testrules gives this warning:
Exception in thread “main” org.languagetool.rules.patterns.PatternRuleTest$PatternRuleTestFailure: Test failure for rule TAB_CHARACTER2[1] in file /org/languagetool/rules/en/grammar.xml: Tab character between twowords."
Errors expected: 1
Errors found : 0
Message: Found a tab v2

Note: twowords

LT does not ‘see’ the tab:

udomai · October 23, 2020, 2:31pm

@Mike_Unwalla, I think this might actually be worth an issue. Just to make sure this doesn’t cause any build or test problems…

Mike_Unwalla · October 23, 2020, 4:18pm

@udomai, there is an issue already: Let the grammar rules find any Unicode character · Issue #1755 · languagetool-org/languagetool · GitHub