Sentence Fragment rule, completed and tested

Irvine · September 16, 2014, 2:11pm

I have attached a zip file containing a rule that searches for “sentence fragments”, plus the full test results, (see below for summary.)

Finding 52 errors in 311 articles may seem excessive; but, while it is only looking for fragment sentences, it touches upon one of the least understood aspects of sentence construction in English: Subordinate conjunctions and their relationship to an independent clause. I am working on a fuller set of rules that, will specifically deal with conjunctions of all types.

Unfortunately, pinning the exact location of an error is nearly impossible; the rule can only say that the error exists. The error message reflects this:

R0.1B: This can be a subtle error: “\2 \3” introduce a subordinate clause; however, there is neither punctuation nor coordinating conjunction to indicate a main clause. Alternatively, if the SC is subordinate to the preceding sentence, the two sentences should be joined. A subordinate conjunction ending a sentence would normally not be punctuated: Though a colon may be used for emphasis. Another possibility, with narrative or rhetorical styles, is a missing question or exclamation mark. Finally, correct punctuation is very sensitive to phrasing.

I hope you find this useful

Irvine

Summary of results:
“Rule 0.1 Sentence fragments”. The test sample was 311 articles from Wikipedia: 11.5 MB excluding footnotes and references.

There are 97 recorded errors, of which 22 were caused by converting tables into pure text, 12 were quotes, (2 of which were deliberate puns,) and 5 were subheadings, 1 error was caused by an abbreviation with an incorrect full stop, (I marked the appropriate link,) and 1 error was caused by the text conversion being unable to handle subscripts, another error was an intentional example. This leaves a total of 55 errors, of these 3 involved mathematical texts: The problem being that it is fairly common, when writing mathematical formula, to spread a sentence over several lines. I suggested bodge solutions, but the truth is, my punctuation rules do not handle this type of notation very well.

Of the other 52 hits, they are all valid punctuation errors, and I have inserted the corrected sentence below the error report. Often, the error is tied up with poor phrasing and sentence construction, (using ‘That’ instead of ‘This’ is very common.) Where this happens, I have offered a basic correction, followed by a fuller correction of the entire passage.

dnaber · September 19, 2014, 1:59pm

Thanks, I have pointed our mailing list to this thread, hoping some native speaker can comment. BTW, where does that name “R0.1B” come from, is that an identifier from some style guide?

Irvine · September 19, 2014, 3:20pm

No, it is for debugging!

I have noticed that some of the more sophisticated rules, for example the one for “missing articles”, have as many as twenty or thirty sub-rules forming rule-groups. Unfortunately, English, is a very dynamic language and, depending on fashion, we can use nouns as verbs. This sometimes leads to annoying error messages such as with “on autopilot”, the error message is technically correct and the article would normally be missing, except, it’s being used as a verb.

Anyway, trying to figure out which missing article rule is causing the problem is an exercise in futility, they all use the same error message and there are twenty nine of them! I would rather my rules were useful and easily repaired, hence each error message has an identifying code which, at the moment is:

0.- : means its tidying up basic punctuation by finding common errors based on word lists.

1.- : means it is working directly with PoS and coordinating conjunctions

2.- : means it is working directly with PoS and subordinate conjunctions

3.- : means it is working directly with PoS and conjunctive adverbs

The second number identifies the particular rule-group, and the letter code identifies the sub rule inside the rule-group. I have most of the rules roughly mapped out, it is just testing and refining that is taking the time.

By the way, if I eventually get them all finished and want to support the rules, is there anyway I can get user feedback, on errors or annoyances?

Irvine

milek_pl · September 19, 2014, 4:12pm

I beg to differ. It’s used after a preposition, and I still have to see a verb after a preposition. It’s definitely a noun.

It’s just a set phrase, or rather an idiomatic phrase:

Best,
Marcin

Irvine · September 19, 2014, 4:24pm

I gracefully bow to your greater knowledge, though it is still an annoying error message.

dnaber · September 19, 2014, 8:27pm

Users can report issues here in the forum or in the bug tracker (Issues · languagetool-org/languagetool · GitHub), you might want to follow both (although the stuff in the bug tracker is mostly very technical).

dnaber · September 20, 2014, 8:20am

The next release of LT is planned for 2014-09-29. I cannot promise that your changes will make it into that version, but could you confirm that we can release the rules and your future contributions under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 or later? (GNU Lesser General Public License v2.1 - GNU Project - Free Software Foundation)

Irvine · September 20, 2014, 9:48am

I always assumed they would be released under the same GNU licence as LT, so I have no problems with that.

For the record: The “Sentence Fragment rule” referred to above, is released under the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 or later.

I will put this statement into any further rules I post.

Irvine

milek_pl · September 21, 2014, 1:27pm

I have already fixed the false alarm in our repository (before even replying to your post)

Best regards,
Marcin

Irvine · September 21, 2014, 2:16pm

It was not meant as a criticism of the rules, it was just an example of how difficult it can be to track down the source of an error message. I find the rules in question to be some of the most useful in LT and a source of great inspiration. Could you point me at a decent tutorial/list of regular expressions as used by your rules in LT, and, if possible, some working examples of how filters are used?

Irvine

Mike_Unwalla · September 21, 2014, 2:38pm

These rules are useful. Thank you.

In Full-Test-Wikipedia–Rule-0.1–Sentence-Fragment.txt , some sentences are incorrectly identified as subordinate clauses. Examples:
That’s the sort of thing we were trained to do. [1]
That internal state is initially set up using the secret key material.
That bill died in committee.
Since then the economy of Dresden has been recovering.
Since the demise of these dams the Colorado River has carved a maximum of about 160 feet (49 m) into the rocks of the Colorado Plateau

From a grammatical perspective only, the first 3 examples have no errors. For the last 2 examples, a comma is necessary after the adverbial phrase [2]. Thus, I suggest that you change the error message to something like this: “X” appears to introduce a subordinate clause…

Irvine wrote: 1 error was caused by an abbreviation with an incorrect full stop, (I marked the appropriate link,)
I do not agree that the abbreviation has an incorrect full stop. The link (Oxford Languages | The Home of Language Data) gives guidelines. The use (or not) of full stops in abbreviations is a style preference.

[1] The Global English Style Guide (http://support.sas.com/publishing/authors/kohl.html), section 5.2, recommends that ‘that’ is not used as a pronoun, but nevertheless, the sentence is not grammatically incorrect.
[2] The Economist Style Guide says that a comma is not necessary (http://www.economist.com/style-guide/commas).

Irvine · September 21, 2014, 4:36pm

Thanks for the positive review, I also feel they are useful.

I do not disagree with your criticisms of the first three examples, though there was an implicit assumption in the offered correction that the error should be more properly considered an independent clause of either the preceding or following sentence. I stated this explicitly with the correction for “That bill died in committee.” As you pointed out, it may be more a question of style, but I think the corrected passages read a lot easier.

The biggest problem is: It is difficult to pin down the exact nature of the error. The rule only says that an error of some kind is highly likely. The other conjunction rules I am working on have a similar problem; they catch errors, but where and why can be very difficult to explain. Often, the errors are more to do with lousy sentence construction than anything else.

I have been sweating blood trying to succinctly reflect this in the error messages. By the way, you do realise, to make navigating the error report easier, I cut out the bulk of the message. In retrospect this was probably a mistake, (it made sense at the time). The full error message, as reported above, is currently:

R0.1B: This can be a subtle error: “\2 \3” introduce a subordinate clause; however, there is neither punctuation nor coordinating conjunction to indicate a main clause. Alternatively, if the SC is subordinate to the preceding sentence, the two sentences should be joined. A subordinate conjunction ending a sentence would normally not be punctuated: Though a colon may be used for emphasis. Another possibility, with narrative or rhetorical styles, is a missing question or exclamation mark. Finally, correct punctuation is very sensitive to phrasing.

I would be grateful for feedback on how I can improve this.

Irvine

Irvine · September 22, 2014, 6:30am

I have given more thought to the basic outline of the error message, two variations:

R0.1A: This can be a subtle error; when “\2” is used to begin a sentence there is an implication of a 2nd clause; however, the rule can find no punctuation or coordinating conjunction. The exact cause may be poor style, but is more likely to be a hard grammatical error, such as: A missing comma, question or exclamation mark. Another possibility is splitting a sentence with closely linked clauses; for example, using a full stop instead a colon. Also, the error can be caused by poor phrasing, or excessive and unclear wordage. Confusing closely related words may also be a cause of this error, e.g using 'that' instead of 'this'. R0.1B: This can be a subtle error; when “\2 \3” is used to begin a sentence there is an implication of a 2nd clause; however, the rule can find no punctuation or coordinating conjunction. The exact cause may be poor style, but is more likely to be a hard grammatical error, such as: A missing comma, question or exclamation mark. Another possibility is splitting a sentence with closely linked clauses; for example, using a full stop instead a colon. Also, the error can be caused by poor phrasing, or excessive and unclear wordage.

I would be grateful for your comments before I submit a revised set of rules.

Irvine

Mike_Unwalla · September 22, 2014, 7:45am

I think that the new messages are better than the original messages.

Will the audience understand the terms ‘clause’ and ‘coordinating conjunction’?

In a previous message, you wrote:
The biggest problem is: It is difficult to pin down the exact nature of the error. The rule only says that an error of some kind is highly likely. The other conjunction rules I am working on have a similar problem; they catch errors, but where and why can be very difficult to explain.

As an alternative to using Finding and Fixing Fragments | Grammar Bytes! as the URL, you could write a web page that is specific to the rules. In the error message, give each possible error a number (or a letter), and on the web page, give a detailed explanation and examples.

dnaber · September 22, 2014, 12:19pm

I have committed these rules now, so we still have one week to improve them before we release LT 2.7. I had to remove your marker (“R0.1H”) from the messages, as these would confuse users. Other than that, I only made small formatting changes (e.g. no like breaks after “”, just putting the tags and the sentence in the same line).

To make updates easier for me, it would be nice to get only the changes to the rules, not the complete rules. Have you used github before, or are you familiar with the “diff” and “patch” commands?

About the message: should the short message maybe simply be “Sentence fragment or missing comma”?

Irvine · September 22, 2014, 6:36pm

Sentence-fragment.diff (40.8 KB)

Attached is a diff file with the new messages. It was not made on a *nix system, but using Xp with GNUwin32.

To get the diff file I used:

“C:\Program Files\GnuWin32\bin\diff.exe” -Naur old new >Sentence-fragment.diff

If there is any problems let me know.

Changes:
1) I have changed the main error message, as outlined.

2) The short message I changed to: “Punctuation error, sentence looks like a fragment”. I didn’t want to be too specific, there can be a variety of causes, though they are all related to punctuation.

3) I reformatted it in line with your suggestions, though on notepad++, this makes code-folding useless. Is this your preferred format for rules?

4) I made a slight change to the pattern, (it makes absolutely no difference to the previous results.) Originally, after the key word or phrase, I skipped a token before searching for punctuation. My reasons were fuzzy, and, while I have not found any specific examples to say this was wrong, I realized I was in danger of catching a totally different set of errors which really need their own separate rules. For example: “After, we went to dinner.” The sentence is incorrect, but it is not the error I’m looking for. The change means this would no longer be flagged by the sentence fragment rule.

As far as Github goes, I am familiar with the idea, but have never used it. I use subversion and tortoise SVN to save download time when updating some of my other programs and am looking into installing Maven, though I find the instructions are not that clear. So, I will get back to you on that one.

Irvine

Ps
On the subject of testing the revisions to ‘skip’ with an antipattern, I have just finished downloading the revised snapshot, unfortunately I have other commitments and will not be able to test it for a few hours.

dnaber · September 22, 2014, 7:31pm

github can also be used from SVN: you can check out GitHub - languagetool-org/languagetool: Style and Grammar Checker for 25+ Languages with SVN. It would be great if you can make future diffs against that SVN checkout, as that makes it easier for me to apply the changes.

dnaber · September 22, 2014, 9:20pm

The results of the automatic tests are now available at https://languagetool.org/regression-tests/20140922/result_en_20140922.html

Most matches are from Tatoeba, which uses a less formal style than Wikipedia. If we want the rule to be enabled by default, we should try to fix some of those matches like “That’s too much.” and “That’s creepy.” and probably others.

Irvine · September 23, 2014, 6:28am

I see what you mean. It is actually very difficult to talk about correct punctuation when the sentence stands in isolation. The fact that many of these examples are common in informal writing styles, while style guides suggest they are best avoided in formal prose, complicates the issue even more.

I think the best thing is to do is modify the rule so that it excludes ‘that’ when it is used as a pronoun, or at least with conjugations of ‘to be’.

For easy reference I have isolated, the occurrences of ‘that’ in the report and you can see such a change will take out the vast majority of hits. Late today, I will see about isolating the other errors, and how they relate to the problem.

That won’t happen
That’s the stupidest thing I’ve ever said.
That is somewhat explained at the end.
That is intriguing.
That wasn’t my intention.
That was the best day of my life.
That’s the absolute truth.
That way I kill two birds with one stone.
That sounds interesting. What did you tell her?
That’s the thing about people who think they hate computers. What they really hate is lousy programmers.
That’s the snag.
That is rather unexpected.
That was probably what influenced their decision.
That was an evil bunny.
That’s very sweet of you.
You’ve got it in one. That’s right.
That person may find another person willing to donate a substantial sum.
That is no business of yours.
Your opinion is off the mark. That’s plain to anyone.
That part of the brain is called the posterior superior temporal cortex (pSTC).
That you don’t believe me is a great pity.
That was well worth the trouble.
That sort of thing can happen when you are in haste.
That church on the hill is very old.
That was a close call.
That which is easily acquired is easily lost.
That Art Thou (1945)
That Art Thou II (1945)
That won’t change anything.
That’s Tom’s house with the red roof.
That boy who is speaking English is taller than I.
That’s something like a movie scenario.
That’s a class act.
That might be the most painful experience in my life.
That’s the kind of person I want you to become.
That’s just a shot in the dark. How do you think you’ll succeed by just acting on…
That’ll do you no good.
That which is evil is soon learned.
That’s the part I liked best.
That would be lovely.
That prediction was claimed confirmed by observations made by a British expedition led by Sir Arthur Eddington during the solar eclipse of 29 May 1919.
That’s too much.
That’s my name.
That’s about it.
That is well said.
That very tune reminded me of my adolescence.
That’s fairly reasonable.
That is not my idea of him.
That’s the way the ball bounces.
That’s a little out of focus.
That last comment was pushing it.
That’s how I feel.
That’s creepy.

Irvine · September 23, 2014, 6:51am

By the way, what do you want the diff file against? The entire English language rule set, or just the sentence fragment rule?

Ideally, since I am not a frequent user of patch and diff, if you could specify exactly what options (e.g. -Naur,) you want, that would be a great help.

Irvine