Phrases as regular expressions

I am looking at my problem of wordiness in a slightly different way and trying to build a set of punctuation rules. At the moment, I am working on conjunctive adverbs, (see here: Conjunctive adverb - Wikipedia )

It’s still a work in progress; but currently I have two rules that test well: The first scans for a comma followed by an adverb, while the second scans for a list of adverbs commonly used as conjunctions. The problem is that many phrases are used as adverbs to join clauses. (They are called ‘conjunctive adverbial phrases’, ugh! )

For example: |on the other hand|in fact|as a result|in comparison|in contrast|just as|in addition|that is|…

Is there a way to turn such a list of phrases into a regular expression that can be used in a single rule?

Also, is there a limit to how many tokens can be used in a regular expression? For testing purpose I am just working with the examples from Wikipedia; but I have a spread sheet with a fuller list of over 100 entries, (both adverbs and phrases,) that is a combination of both Wikipedia and Cambridge dictionary examples.

Irvine

For information only, the two rules I have so far are:

<rule id="conjunctive_adverbs_1" name="punctuation_of_conjunctive_adverbs_1">    
	<pattern>
		<marker>
			<token regexp='yes'>,</token>
			<token postag='RB|RBRR|RBS' postag_regexp='yes'></token>
		</marker>
	</pattern>
	<message>Adverbs, when used as a conjunction between two clauses of a sentence, are normally preceded by a semicolon.</message>
	<url>http://en.wikipedia.org/wiki/Conjunctive_adverb</url>
	<short>Adverbs, when used as a conjunction between two clauses, are normally preceded be preceded by a semicolon.</short>
	<example type='incorrect'>He can leap tall buildings in a single bound<marker>, furthermore</marker>, Dwight Schrute is a hog.</example>
	<example type='correct'>He can leap tall buildings in a single bound; furthermore, Dwight Schrute is a hog.</example>
	<example type='incorrect'>Oh, there's a butterfly.</example>
	<example type='correct'>Oh. There's a butterfly.</example>
</rule>




<rule id="conjunctive_adverbs_2" name="punctuation_of_conjunctive_adverbs_2">    
	<pattern>
		<marker>
			<token>,</token>
			<token regexp='yes'>accordingly|additionally|again|almost|although|anyway|besides|certainly|comparatively|consequently|contrarily|conversely|elsewhere|equally|eventually|finally|further|furthermore|hence|henceforth|however|incidentally|indeed|instead|likewise|meanwhile|moreover|namely|nevertheless|next|nonetheless|notably|now|otherwise|rather|similarly|still|subsequently|then|thereafter|therefore|thus|undoubtedly|uniquely</token>
		</marker>
	</pattern>
	<message>Conjunctive adverbs should be preceded by a semicolon.</message>
	<url>http://en.wikipedia.org/wiki/Conjunctive_adverb</url>
	<short>Conjunctive adverbs should be preceded by a semicolon.</short>
	<example type='incorrect'>He can leap tall buildings in a single bound<marker>, furthermore</marker>, Dwight Schrute is a hog.</example>
	<example type='correct'>He can leap tall buildings in a single bound; furthermore, Dwight Schrute is a hog.</example>
</rule>

They are not finished yet, nor tested in my Openoffice instalation and still require supplementary rules to catch missing termination commas.

We have the rarely used “phrases” element which you could try to define phrases and then the “phraseref” to refer to that. See the Polish rules for an example.

https://raw.githubusercontent.com/languagetool-org/languagetool/master/languagetool-language-modules/pl/src/main/resources/org/languagetool/rules/pl/grammar.xml

Rules can sometimes be merged by using the “…” tags around tokens, although I’m not sure if that works well together with “phraseref”.

No. In theory, large regular expressions might get slow, but this is not something one should care about at this stage in development.