SENT_END and PARA_END

Currently setting SENT_END on the last token makes some rules a bit flaky.
Consider these two sentences in https://languagetool.org/

He pointed to it’s reddest area.

and

He pointed to it’s reddest area

First generates an error, while second does not.
The reason is that negate_pos=“yes” on the last token in the rule need to take to account SENT_END and many rules (if not most) do not. To allow the rule to work for sentence that ends on the last word you have to add “|SENT_END” to the postag attribute. That’s a bit tricky to remember.

See the patch below to illustrate this in grammar.xml

What’s worse your last token may also get PARA_END (I suspect you can’t trigger that in grammar.xml but it happens on real texts via command line or REST API).
So technically on any rule that has negate_pos in the last token you need to add “|SENT_END|PARA_END”.

This technically may also apply to some Java rules (I know I noticed this moment with SENT_END while writing some of Ukrainian Java rules, but I don’t think even accounted for PARA_END).

diff --git a/languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml b/languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
index d4e8d21..68603d2 100644
--- a/languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
+++ b/languagetool-language-modules/en/src/main/resources/org/languagetool/rules/en/grammar.xml
@@ -2990,6 +2990,8 @@
                 </pattern>
                 <message>Did you mean <suggestion>its <match no="4"/> <match no="5"/></suggestion>?</message>
                 <example correction="its reddest area">For the painting, <marker>it's reddest area</marker> was in the upper left.</example>
+                <example correction="its reddest area">He pointed to <marker>it's reddest area</marker>.</example>
+                <example correction="its reddest area">He pointed to <marker>it's reddest area</marker></example>
             </rule>
             <!-- for it's .*/JJ|NN|NNS::word=for its::pivots=\1,its -->
             <rule id="FOR_ITS_NN" name="for its NN (possessive)">

issue #1205

I think that SENT_START and SENT_END should be handled similarly.

Here’s another interesting moment. Sometimes the sentence gets \n and PARA_END after SEND_END, here’s the AnalyzedSentence (tagged as part of bigger text):

[<S> Псевдосервіс[Псевдосервіс/null],[,/null] будь[бути/verb:imperf:impr:s:2] ласка[ласка/noun:anim:f:v_naz:xp1,ласка/noun:inanim:f:v_naz:xp2,</S>] <P/> ]

Here “ласка” gets SENT_END, but then the sentence has one more token “\n”, marked as PARA_END. Interestingly even though this last token is \n and isWhitespace() returns true, it’s returned as part of sentence.getTokensWithoutWhitespace(). So rules that will get \n as a regular token.

To continue this topic, I have a branch with a flag for the Language class to make SENT_END/PARA_END a separate token at the end, so it’s symmetrical to SENT_START.
I’ve tested it on Ukrainian and rules do get cleaner.
After we release 5.3, I’ll rebase it on master and will push to the remote branch so those interested can try it.
It would be great if we only have to support one way for SENT_END but suspect with so many languages and so many rules we’ll have to have both for a long time.

Hi all,

I’ve pushed my changes to create SENT_END/PARA_END as a separate token in the end of the sentence under “sent_end” branch.
With this change (if your language sets the appropriate flag) SENT_END/PARA_ENE would be treated similarly to SENT_START.
There’s one (smaller) commit to adjust the core and another (bigger) to adjust Ukrainian to use it.

The regression tests I on my corpus ran look pretty good.
Note: I haven’t done any optimization, e.g. shortening loops that don’t care about sentence end etc.

If anybody interested could take a look see if it’s useful/acceptable that would be great. If it is, I’d like to merge it to master and use this mode for Ukrainian to simplify the logic.

Thanks,
Andriy