English These - singular

PeterLawrence · August 18, 2015, 9:25am

Hi, I’m writing a rule which catches an error if “these” is followed by singular expression.
For example…

<rule id="THESE-NN" name="These singular">    
         <pattern>
          <marker>
            <token>these</token>
          </marker>
          <token postag='NN'><exception postag='NNS|CD' postag_regexp='yes'></exception></token>
          <token postag='MD'></token>
         </pattern>
         <message>Did you mean <suggestion>this</suggestion>?</message>
         <example type="incorrect" correction='This'><marker>These</marker> author should rewrite this point.</example>
         <example type="incorrect" correction='This'><marker>These</marker> writer should rewrite this point.</example>
        <example>These two should do ok.</example>
        </rule>

However, for some reason the word “author” doesn’t get assign a “Part-of-speech” if preceded by the word “these”.
I’ve checked the rulebase and found a few other rules which deal with “This” and “These”, for example ‘this’ vs. ‘these’ and these/those ones (these/those), but nothing which could impact on the word “author”. Any idea what’s special about the word author?

Thanks

dnaber · August 18, 2015, 9:42am

This seems to be caused by rule DT_plural_VBNNN_VBN in disambiguation.xml. You can find that out easily by running the sentence through Text Analysis - LanguageTool.

PeterLawrence · August 18, 2015, 10:42am

Do you think the DT_plural_VBNNN_VBN needs to be modified so it doesn’t remove all postags?

Would changing the exception line, in DT_plural_VBNNN_VBN, to include MD like this…

… a possible solution?

dnaber · August 18, 2015, 11:37am

I’m not sure. As changes in disambiguation.xml may have an effect on several rules, it’s necessary to test them carefully. But you might get more helpful replies on the mailing list. For example, Marcin has worked a lot on the English disambiguation.

PeterLawrence · August 18, 2015, 1:01pm

OK will look into this. I think the solution might be simply to ensure that this disambig pattern never removes all postags from a word. Hopefully this disambig pattern only impacts on rules that utilise “these”.

PeterLawrence · August 18, 2015, 3:23pm

Hi, I assume you mean the mailing list on sourceforge?
I’ve also found the same issue with the word “man”.
For now I’ve found a solution using “chunks”…

<rule id="THESE-NN-BLOCK" name="These singular block">    
         <pattern>
          <marker>
            <token chunk="B-NP-singular">these</token>
          </marker>
          <token chunk="E-NP-singular"><exception postag='NNS|JJ' postag_regexp='yes'></exception><exception regexp="yes">are|yours|alone</exception></token>
          <token postag='MD'></token>
         </pattern>
         <message>Did you mean <suggestion>this</suggestion>?</message>
         <example type="incorrect" correction='This'><marker>These</marker> author should rewrite this point.</example>
        </rule>

…but I feel this is not a great solution.
Thanks

dnaber · August 18, 2015, 3:33pm

Yes, I mean https://languagetool.org/development/mailing-list.php - it’s still hosted at Sourceforge as it’s not so easy to find a good mailing list hoster (github etc. don’t offer mailing lists).

PeterLawrence · September 14, 2015, 6:24pm

Hi, Just letting you know I’ve had no replies from the mailing list on this issue.
It’s a tricky one to solve since if a disambig rule removes all postags it kind of limits the scope for developing other rules.
I imagine the intention of this disambig rule was not to remove all postags. Hence is it possible to catch the point when all postags are remove and then re-assign the word a postag selected from its original list?

Thanks

Peter

Mility · September 18, 2015, 2:01am

Hi, Peter, do you check your rules after write them? checked these rules against e.g. Wikipedia or some big data?

PeterLawrence · September 21, 2015, 3:21pm

Hi Mility, yes I do check my rules using the Wikipedia data.
It’s while running the wiki test I identify most issues.
I also tend to not submit rules until I’ve tested them in my version for a number of months first, which is how I spotted this disambiguation issue.

Mility · September 23, 2015, 12:29pm

In http://languagetool-user-forum.2306527.n4.nabble.com/add-rules-td4643105.html.
For some reasons, I don’t have those big data and the hardware environment to test, could you help me to rewrite the rules which could be real rules? Thanks advance.

dnaber · September 23, 2015, 12:41pm

Well, if it was easy to write the ngram rules as XML rules, we’d do that. But it’s not, the ngram rule covers all kinds of cases that cannot simply be listed in an XML rule.

Mility · September 23, 2015, 12:47pm

Sorry, what is ngram rule?

dnaber · September 23, 2015, 12:52pm

Sorry, I misunderstood your question. I thought your question referred to Finding errors using n-gram data - LanguageTool Wiki. Anyway, then I don’t understand your question.

Mility · September 23, 2015, 1:00pm

Oh,
For some reasons, I don’t have those big data(eg.Wiki) and the hardware environment to test those rules in http://languagetool-user-forum.2306527.n4.nabble.com/add-rules-td4643105.html, I mean I want ask Peter for help, help me to rewrite the rules.