A little help with rule syntax please

Irvine · August 18, 2014, 6:23am

Sorry to bother you again, but I am having problems with the syntax for ‘skip backwards’. I would like to find a comma, followed by a verb and then look backwards 7 tokens for SENT_START as in the following:

		 <antipattern>
			 <token postag=','></token>
			 <token postag='RB|RBS|RP|WH' postag_regexp='yes' skip="-7"></token>
			 <token postag='SENT_START|;|:' postag_regexp='yes'></token>
		 </antipattern>

In a different rule which found a verb after SENT_START and looked forward 7 tokens, the following worked perfectly:

		 <antipattern>
			 <token postag='SENT_START'> </token> 
			 <token postag='RB|RBS|RP|WH' postag_regexp='yes' skip="7"></token>
			 <token postag=','></token>
		 </antipattern>

What is the difference and what am I doing wrong?

dnaber · August 18, 2014, 6:57am

I don’t think negative skips are possible (other than -1, which is a special case for “no limit”). I’m not sure I understand what you’re trying to do, but what about this? It should match a sentence with a comma followed by a token with the given tags, not more than 7 tokens after the start of a sentence.

  <token postag='SENT_START' skip="7"> </token> 
  <token postag=','></token>
  <token postag='RB|RBS|RP|WH' postag_regexp='yes' skip="-7"></token>

Irvine · August 18, 2014, 10:52am

Not exactly what I had in mind, the antipattern is a simple test to distinguish between a phrase and a clause. In English, for the most part, introductory conjunctive adverbial, (or adjectival,) phrases are less than about 7 tokens long. After that, they tend to be clauses in their own right, (it makes a kind of logical sense if you think about it.)

If I just scan for a comma followed by an adverb then the number of false positives is quite high. On the other hand, if I eliminate any hits that are preceded by a phrase no longer than 7 tokens, I might miss a few genuine hits, but I cut the number of false positives drastically. In fact, any surviving false positives are frequently indicative of some other type of error. e.g. a misplaced comma; a missing article or pronoun; or simply poor phrasing and punctuation.

Example 1:

If the adverb forms a conjunction between two clauses, then the comma should normally be replaced by a semicolon.

Two clauses:

A) the adverb forming a conjunction between two clauses
B) the comma being replaced

So, the corrected sentence reads:

If the adverb forms a conjunction between two clauses; then, the comma should normally be replaced by a semicolon.

Example 2:

Meanwhile, still beaming, Cindy dropped the plate.

Two introductory phrases with a single clause.
A) Meanwhile, an introductory adverbial phrase.
B) still beaming, an adjectival phrase that starts with the adverb ‘still’
C) Cindy dropped the plate, the main clause.

As a result, in this 2nd example, (even though, the adverb ‘still’ follows a comma,) it is not a conjunction between two clauses and the comma is the correct punctuation. Actually, it would be completely wrong to use a semicolon in example 2.

In summary, what I am trying to do is find possible conjunctive adverbs, eliminate false positives by looking backwards upto 7 tokens for clausal punctuation; thus, verifying it is a conjunction of two clauses and not the conjunction of two phrases or a phrase and clause.

I hope that makes sense. It does to me, but at the moment I am so deeply involved in it that many of my friends politely reply: “Huh?”

dnaber · August 18, 2014, 11:22am

What about this? Not sure now why it doesn’t work when SENT_START is explicitly specified…

<rule id="ID" name="my rule">    
                <pattern>
                    <token min="7" />
                    <marker>
                        <token>,</token>
                        <token postag='RB|RBS|RP|WH' postag_regexp='yes'></token>
                    </marker>
                </pattern>
                <message>My message</message>
                <example type="correct">Meanwhile, still beaming, Cindy dropped the plate.</example>
                <example type="incorrect">If the adverb forms a conjunction between two clauses<marker>, then</marker>
                     the comma should normally be replaced by a semicolon.</example>
            </rule>

dnaber · August 18, 2014, 11:48am

BTW, LanguageTool comes with a testrules.sh (or testrules.bat for Windows) script that does some checks on the rule. It will notice illegal values like skip="-7".

Irvine · August 19, 2014, 6:38am

Thank you, I think I can work with that. Used with a logical “OR”, this could solve many problems.

Although, at the moment, I am concentrating on building a set of rules dealing with the special case of conjunctive adverbs, if I can minimise the number of false positives in a way that is limited to genuine punctuation errors; then, the principals could be applied to a wide variety of other types of conjunction.

A couple of questions though:

Here, Development Overview - LanguageTool Wiki it says: “Note than min only accepts values 0 or 1.”, has this now changed?

Here, Development Overview - LanguageTool Wiki it mentions a logical “OR”. I have found some examples in my grammar.xml file; but could you explain whether I have to “OR” patterns, rules or rule segments. For example should, in principle, this work?

		<pattern>
                           <token min="7" />
         		<marker>
				<token postag=','></token>
				<token postag='RB|RBRR|RBS|WH' postag_regexp='yes'></token>
			</marker>
			<or>
				<token postag=';|:' postag_regexp='yes'></token>
				<token min="7" />
				<marker>
					<token postag=','></token>
					<token postag='RB|RBRR|RBS|WH' postag_regexp='yes'></token>
				</marker>
			</or>
		</pattern>

Thanks again for your help

Irvine

dnaber · August 19, 2014, 7:34am

Yes, thanks for letting us know. I have updated the documentation now.

You can only use it around “token” elements. Your (XML) editor should also indicate an error if you try something else, and if you use code completion it should only suggest “token” inside “or”.