Here are some ideas for methods to decrease the number of false alarms in ENGLISH_WORD_REPEAT_RULE:
If a pair of words is in quotes (single or double), ignore the words. A writer is not likely to put 2 words into quotes and spell them the same and intend to put only 1 word into quotes.
Example: … crowns and an orange base of bill. Its call is a ‘‘mjerp mjerp’’.
If two words are both capitalized, assume that they are a proper noun or part of a proper noun.
Example: Bye Bye Blackbird was published in 1926 (https://en.wikipedia.org/wiki/Bye_Bye_Blackbird).
Counter-example: The The Problems of Analysis
If two non-English words (postag=UNKNOWN] are capitalized, assume that they are a proper noun, or part of a proper noun.
Example: Former Samoan international Ngapaku Ngapaku (known as Pux) was coach.
Genera are frequently correctly duplicated. Possibly, get a list of genera, which can be used to prevent false warnings with both ENGLISH_WORD_REPEAT_RULE and the spelling rule.
: [[White stork]] ‘’(national bird)’’||’‘Ciconia ciconia’’
that that [simple exception: verb + that + that + modal verb]
…, and I know that that’s not really my heart.
If I think that that’ll be a problem, I will tell you.
If I think that that will be a problem, I will tell you.
He knew that that would be a problem.
ENGLISH_WORD_REPEAT_RULE is a Java rule, so I don’t feel confident to change it.