Help on rule with optional token

BebraiPuola · May 3, 2016, 9:22am

Can someone explain me why this rule doesn’t find the expected error in ‘L is an extreme points.’?

<rule id="IS_A_AN_PLURAL" name="IS_A_AN_PLURAL">
 <pattern>
  <token postag='VBZ'></token>
  <token regexp='yes' postag='DT' chunk='B-NP-plural'>a|an</token>
  <token min='0' max='3'></token>
  <marker>
  <token chunk='E-NP-plural'><exception postag='NN'></exception></token>
  </marker>
 </pattern>
 <message>Please check verb-subject agreement. Verb: "<match no="1"/>", subject: "<match no="4"/>"</message>
 <example correction=''>L is an extreme <marker>points</marker>.</example>
 <example>L is an extreme point.</example>
</rule>

Thanks

dnaber · May 3, 2016, 9:33am

You can use Text Analysis - LanguageTool to see how a sentence gets analysed. In this case, extreme points gets tagged incorrectly due to disambiguation rules (from disambiguation.xml) shown by the site under “Disambiguator log”.

BebraiPuola · May 3, 2016, 9:56am

But shouldn’t this match anyway? Because ‘extreme’ should be within <token min='0' max='3'></token> and ‘points’ have chunk=‘E-NP-plural’ as I am looking for?

dnaber · May 3, 2016, 10:04am

You’re right - does it work if you use <token></token> instead of <token min='0' max='3'></token>?

BebraiPuola · May 3, 2016, 10:07am

Yes, it does match…

dnaber · May 3, 2016, 10:16am

You could submit a bug report at Issues · languagetool-org/languagetool · GitHub, but I wouldn’t hold my breath for this to get fixed. That part of the code is rather convoluted.

oriain · May 7, 2016, 7:41am

The two sentences and two examples for the min and max attributes from the Development Overview wiki page don’t give a lot of info, but I think the max attribute is the reason why the rule does not match the test sentence.

Max appears to operate greedily. I am relatively new to LanguageTool, but I think the max attribute tells LT to try to match the next three words in a sentence with the token in question. In this case, “is” matches with the POS tag of “VBZ”. Then “a” matches via regular expressions. The blank token then matches both “extreme” and “points,” which ruins the rest of the pattern.

If you try your rule with the following sentence, you should see that it matches. “This is an extremely large onerous points.” But if you try it again with a slightly longer sentence, the rule once again loses its match. “This is an extremely large onerous tiresome points.”

oriain · May 7, 2016, 3:39pm

If you add chunk='I-NP-plural to the third token in the pattern, you can make it avoid matching the last token of the plural noun phrase. This won’t account for noun phrases longer than five tokens (1 article + 0 - 3 adjectives + 1 noun) but it’s still a small improvement.