Hi, I’m writing a rule which catches an error if “these” is followed by singular expression.
For example…
<rule id="THESE-NN" name="These singular">
<pattern>
<marker>
<token>these</token>
</marker>
<token postag='NN'><exception postag='NNS|CD' postag_regexp='yes'></exception></token>
<token postag='MD'></token>
</pattern>
<message>Did you mean <suggestion>this</suggestion>?</message>
<example type="incorrect" correction='This'><marker>These</marker> author should rewrite this point.</example>
<example type="incorrect" correction='This'><marker>These</marker> writer should rewrite this point.</example>
<example>These two should do ok.</example>
</rule>
However, for some reason the word “author” doesn’t get assign a “Part-of-speech” if preceded by the word “these”.
I’ve checked the rulebase and found a few other rules which deal with “This” and “These”, for example ‘this’ vs. ‘these’ and these/those ones (these/those), but nothing which could impact on the word “author”. Any idea what’s special about the word author?
This seems to be caused by rule DT_plural_VBNNN_VBN in disambiguation.xml. You can find that out easily by running the sentence through Text Analysis - LanguageTool.
I’m not sure. As changes in disambiguation.xml may have an effect on several rules, it’s necessary to test them carefully. But you might get more helpful replies on the mailing list. For example, Marcin has worked a lot on the English disambiguation.
OK will look into this. I think the solution might be simply to ensure that this disambig pattern never removes all postags from a word. Hopefully this disambig pattern only impacts on rules that utilise “these”.
Hi, I assume you mean the mailing list on sourceforge?
I’ve also found the same issue with the word “man”.
For now I’ve found a solution using “chunks”…
<rule id="THESE-NN-BLOCK" name="These singular block">
<pattern>
<marker>
<token chunk="B-NP-singular">these</token>
</marker>
<token chunk="E-NP-singular"><exception postag='NNS|JJ' postag_regexp='yes'></exception><exception regexp="yes">are|yours|alone</exception></token>
<token postag='MD'></token>
</pattern>
<message>Did you mean <suggestion>this</suggestion>?</message>
<example type="incorrect" correction='This'><marker>These</marker> author should rewrite this point.</example>
</rule>
Hi, Just letting you know I’ve had no replies from the mailing list on this issue.
It’s a tricky one to solve since if a disambig rule removes all postags it kind of limits the scope for developing other rules.
I imagine the intention of this disambig rule was not to remove all postags. Hence is it possible to catch the point when all postags are remove and then re-assign the word a postag selected from its original list?
Hi Mility, yes I do check my rules using the Wikipedia data.
It’s while running the wiki test I identify most issues.
I also tend to not submit rules until I’ve tested them in my version for a number of months first, which is how I spotted this disambiguation issue.
Well, if it was easy to write the ngram rules as XML rules, we’d do that. But it’s not, the ngram rule covers all kinds of cases that cannot simply be listed in an XML rule.