English:PRP_MD_CD_IN missing verb

PeterLawrence · June 7, 2015, 8:27pm

Here’s a rule for spotting a missing verb in PRP, MD, CD, IN sentence sequence.
I’ve tested this rule on the rather large enwiki-latest-pages-articles.xml file.
So far it has been tested on over 46 million sentences! over the past couple of days, before I had to reboot the machine.
How long does it normally take run this test? Also is there a more efficient method since this does seem rather excessive.

<rule id="PRP_MD_CD_IN" name="PRP_MD_CD_IN missing verb">    
         <pattern>
          <token postag='PRP'><exception scope="previous" postag="IN|VBN" postag_regexp="yes"></exception></token>
          <marker>
            <token postag='MD'></token>
            <token postag='CD'></token>
          </marker>
          <token postag='IN'></token>
         </pattern>
         <message>Did you mean <suggestion><match no="2"/> use <match no="3"/></suggestion> or <suggestion><match no="2"/> be <match no="3"/></suggestion>?</message>
         <example correction='could use one|could be one'>We <marker>could one</marker> of them.</example>
         <example correction='can use two|can be two'>We <marker>can two</marker> of these.</example>
         <example correction='may use one|may be one'> That she <marker>may one</marker> of the Tyr-Ridan</example>
         <example type="correct">I'm not convinced we need one for that.</example>
         <example type="correct">Any more than I would one of NAFTA medals.</example>
        </rule>

dnaber · June 8, 2015, 6:41am

Thanks, I’ve added the rule. There’s no need to run the test on the entire Wikipedia, I usually stop after 50,000 sentences or so.