[EN] Word repetition false alarm (May may...)

I found an exception for the Rule “Word repetition (e.g. ‘will will’)”

LanguageTool displays “Possible typo: you repeated a word.”

May may go for a walk.

Could you add it to the exceptions for this rule?

You actually found two.
Will [Smith] will go for a walk, but…
How many scenarios include this false detection in comparison with legite detections of may may?

Can we suppose that an uppercase May which has an NNP tag will separate it from valid error cases?

That is possible. It has been done before, so it may be added. Sonner or later English language maintainers will look into it.

Thank you.


I can add this override to EnglishWordRepeatRule.java:

if (wordRepetitionOf("May", tokens, position) && posIsIn(tokens, position-1, "NNP")) {
  return true; // "May may go for a walk tomorrow."

But, I need four overrides: May/Will in declarative sentences and in questions. Example: … but will Will go also?) Is there a shorter alternative method or must I use four separate overrides?

You’d need four ifs. Maybe we can keep it simple and not even care about POS, case should be enough if it’s considered (haven’t check now)?

‘different’ but related false alarm. (posting here to reduce risk of fragmentation)
“… (not that that many were needed.)”

EG: “The TV could remember up to 128 channels at a time. (not that that many were needed)”

The Maven tests gave an error message when I added these overrides to EnglishWordRepeatRule.java:

if (wordRepetitionOf("May", tokens, position) && posIsIn(tokens, position-1, "NNP")) {
  return true; // "May may go for a walk tomorrow."
if (wordRepetitionOf("May", tokens, position) && posIsIn(tokens, position, "NNP")) {
  return true; // "Sir, may May walk with me tomorrow?"
if (wordRepetitionOf("Will", tokens, position) && posIsIn(tokens, position-1, "NNP")) {
  return true; // "... but if Will will go for a walk tomorrow..."
if (wordRepetitionOf("Will", tokens, position) && posIsIn(tokens, position, "NNP")) {
  return true; // "... and will Will also go for a walk tomorrow?"
if (wordRepetitionOf("will", tokens, position) && posIsIn(tokens, position, "MD")) {
  return true; // "The legal people say that the will will probably cause problems."

I cannot see where my error is, so I did not add the rules. Can one of the more experienced maintainers please add the (corrected) overrides?

The overrides in EnglishWordRepeatRule are case-sensitive. Thus, if I write ‘Blah blah’, LT gives me an error message. An option to make an override case-insensitive would be useful.

@SkyCharger001 , EnglishWordRepeatRule has an override for ‘that that’. The rule does not override your example because ‘many’ does not have the postag NN. Merriam-Websters Collegiate Dictionary, 10th edition, tells me that ‘many’ can be a noun (example: “… a good many of them…”). I will do some tests, and if nothing breaks, I will add the postag. Update: Webster’s and Shorter Oxford English Dictionary show that ‘many’ is used in a plural sense, thus the postag is NNS. Adding the postag NNS to ‘many’ will not help, because the rule uses NN. Also, when I add NNS to many, the postag is removed by disambiguation rule PDT_DT. So, I will not add the postag at this stage.

@dnaber, I think that the rules should be as rigorous as possible, which is why I included the POS.

The tests fail because “I will will hold the ladder.” will not be detected anymore with the change. I don’t have time to fix that now, though.

Not sure if this is would be regarded as simpler/better:

   if (wordRepetitionOf("May", tokens, position) && (posIsIn(tokens, position-1, "NNP") || posIsIn(tokens, position, "NNP"))) {
      return true; // "May may go for a walk tomorrow." / "Sir, may May walk with me tomorrow?"
    if (wordRepetitionOf("Will", tokens, position) && (posIsIn(tokens, position-1, "NNP") || posIsIn(tokens, position, "NNP", "MD"))) {
      return true; // "... but if Will will go for a walk tomorrow..." / "... and will Will also go for a walk tomorrow?" / "The legal people say that the will will probably cause problems."

@Mike_Unwalla You can fix the test failure by removing line 48 from this file.
Also, you might want to add a few more examples there.

@curon, @Jan_Schreiber thanks, but this rule defeats me. I give up. I continue to get a Maven error message with the smallest possible change.

In EnglishWordRepeatRule.java, I added:

if (wordRepetitionOf("May", tokens, position) && posIsIn(tokens, position, "NNP")) {
  return true; // "Sir, may May walk with me tomorrow?"

In EnglishWordRepeatRuleTest.java, I added:

assertGood("Sir, may May walk with me tomorrow?");

From the LT GUI, I see that in this test sentence, ‘May’ has the postag NNP, so I do not understand what I am doing wrong.

Maybe disambiguation conflict, but it looks good.
This works:

.../languagetool/rules/en/EnglishWordRepeatRule.java   | 18 ++++++++++++++++++
 .../rules/en/EnglishWordRepeatRuleTest.java            |  8 ++++++++
 2 files changed, 26 insertions(+)

diff --git a/languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java b/languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java
index ae2a42f..b908c1c 100644
--- a/languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java
+++ b/languagetool-language-modules/en/src/main/java/org/languagetool/rules/en/EnglishWordRepeatRule.java
@@ -76,6 +76,24 @@ public class EnglishWordRepeatRule extends WordRepeatRule {
     if (wordRepetitionOf("Li", tokens, position)) {
       return true;   // "Li Li", Chinese name
+    if (position > 0 && tokens[position - 1].getToken().matches("may") && tokens[position].getToken().matches("May")) {
+      return true;   // "may May"
+    }
+    if (position > 0 && tokens[position - 1].getToken().matches("May") && tokens[position].getToken().matches("may")) {
+      return true;   // "May may"
+    }
+    if (position > 0 && tokens[1].getToken().matches("May") && tokens[2].getToken().matches("May")) {
+      return true;   // "May May" SENT_START
+    }
+    if (position > 0 && tokens[position - 1].getToken().matches("will") && tokens[position].getToken().matches("Will")) {
+      return true;   // "will Will"
+    }
+    if (position > 0 && tokens[position - 1].getToken().matches("Will") && tokens[position].getToken().matches("will")) {
+      return true;   // "Will will"
+    }
+    if (position > 0 && tokens[1].getToken().matches("Will") && tokens[2].getToken().matches("Will")) {
+      return true;   // "Will Will" SENT_START
+    }
     return false;
diff --git a/languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/EnglishWordRepeatRuleTest.java b/languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/EnglishWordRepeatRuleTest.java
index b280f91..9d10518 100644
--- a/languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/EnglishWordRepeatRuleTest.java
+++ b/languagetool-language-modules/en/src/test/java/org/languagetool/rules/en/EnglishWordRepeatRuleTest.java
@@ -45,6 +45,14 @@ public class EnglishWordRepeatRuleTest {
     assertGood("It was said that that lady was an actress.");
     assertGood("Kurosawa's three consecutive movies after Seven Samurai had not managed to capture Japanese audiences in the way that that film had.");
     assertGood("The can can hold the water.");
+    assertGood("May May awake up?");
+    assertGood("May may awake up.");
+    assertBad("I may may awake up.");
+    assertBad("That is May May.");
+    assertGood("Will Will awake up?");
+    assertGood("Will will awake up.");
+    assertBad("I will will awake up.");
+    assertBad("That is Will Will.");
     assertBad("I will will hold the ladder.");
     assertBad("You can feel confident that that this administration will continue to support a free and open Internet.");
     assertBad("This is is a test.");

Try before patching, of course.

@tiagosantos, thanks, but I continue to get a Maven error. If I add only the code to EnglishWordRepeatRule.java, Maven gives a success message. But when I add the test sentences to EnglishWordRepeatRule.java, Maven gives an error message:

[INFO] -------------------------------------------------------
[INFO] Running org.languagetool.rules.en.EnglishWordRepeatRuleTest
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 1.93 s <<< FAILURE! - in org.languagetool.rules.
[ERROR] testRepeatRule(org.languagetool.rules.en.EnglishWordRepeatRuleTest)  Time elapsed: 1.78 s  <<< FAILURE!

Expected: is <0>
     but: was <1>
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:20)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:8)
        at org.languagetool.rules.en.EnglishWordRepeatRuleTest.assertMatches(EnglishWordRepeatRuleTest.java:71)
        at org.languagetool.rules.en.EnglishWordRepeatRuleTest.assertGood(EnglishWordRepeatRuleTest.java:62)
        at org.languagetool.rules.en.EnglishWordRepeatRuleTest.testRepeatRule(EnglishWordRepeatRuleTest.java:47)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

Odd. I compile successfully without warnings, and I am doing it now again without any error on the English module.
Notice that I have corrected a typo about 10minutes after posting it, so the error may be due to using the code with the typo.
Anyway, if you prefer, I can commit the changes and you can analyse Travis results and the daily build.
If the change has some problem, it can be reverted before the build takes effect.

@tiagosantos , I don’t use Travis and I don’t have the time to install it and learn how to use it.

But, if you think that the code is good, please commit it. I can look at the regression to see if there are problems. (Maybe best to commit on Thursday, because I cannot spend any time tomorrow on LT.)

Travis is a tool attached to GitHub repo. You can only see it, and it shows maven compile results. Just click on the ‘green tick’ that appear next to the commit on GitHub. Similar to Jenkins.
For example, the log of the latest commit:

The code is good. I will push on Thursday, then.

@tiagosantos, many thanks.

If you have the time, please show a screen shot, thanks. I looked on [en] EnglishWordRepeatRule improvements · languagetool-org/languagetool@d3cff66 · GitHub, but I cannot see a green tick.

Commits page, not inside the commit.

I made git commit twice and one git push. In these situations Travis only tests the final result.
Check the tick next to:
@TiagoSantos81 [gl] general gender agreement rules added
TiagoSantos81 committed a day ago db33b69
@TiagoSantos81[en] EnglishWordRepeatRule improvements …TiagoSantos81 committed a day ago

Or the latest one. I haven’t touch the file ever since.