Tokenizing specifics via REST API

I was looking into the problem that “12,75 грн.” (correctly) does not generate error in rule tests and command line, but does create false positive via REST API.
If I add text after this sentence the error goes away.

It seems like there are some differences in tokenizing/tagging, e.g. command line:
[<S> 12,75[12,75/number] грн[грн/noun:inanim:f:v_dav:nv:abbr,грн/noun:inanim:f:v_mis:nv:abbr,грн/noun:inanim:f:v_naz:nv:abbr,грн/noun:inanim:f:v_oru:nv:abbr,грн/noun:inanim:f:v_rod:nv:abbr,грн/noun:inanim:f:v_zna:nv:abbr,грн/noun:inanim:p:v_dav:nv:abbr,грн/noun:inanim:p:v_mis:nv:abbr,грн/noun:inanim:p:v_naz:nv:abbr,грн/noun:inanim:p:v_oru:nv:abbr,грн/noun:inanim:p:v_rod:nv:abbr,грн/noun:inanim:p:v_zna:nv:abbr].[</S>]<P/> ]

If I add output in JLanguageTool running on the server I get this:
[<S> </S><P/> , <S> 12,75[12,75/number] грн[грн/noun:inanim:f:v_dav:nv:abbr,грн/noun:inanim:f:v_mis:nv:abbr,грн/noun:inanim:f:v_naz:nv:abbr,грн/noun:inanim:f:v_oru:nv:abbr,грн/noun:inanim:f:v_ rod:nv:abbr,грн/noun:inanim:f:v_zna:nv:abbr,грн/noun:inanim:p:v_dav:nv:abbr,грн/noun:inanim:p:v_mis:nv:abbr,грн/noun:inanim:p:v_naz:nv:abbr,грн/noun:inanim:p:v_oru:nv:abbr,грн/noun:inanim:p:v_rod: nv:abbr,грн/noun:inanim:p:v_zna:nv:abbr].[</S><P/>]]

So there are two things here:

  1. Empty sentence is added at the front when tokenizing via REST API
  2. The ending period gets “concatenated”(?) tag [</S><P/>], instead of [</S>]<P/> in command line. I suspect due to [</S><P/>] tag my <token negate_pos="yes" postag="SENT_END">.</token> does not work.

How exactly do you test this? Can you send the curl calls, for example?

I just typed 12,75 грн. into
I the browser network tool (F12) I see the request is correct (matching the text entered).
So I’ve added output for original, tokenized, and tagged sentences in JLanguageTool and submitted this request on local server.

curl '' -H 'User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Referer:' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'X-Requested-With: XMLHttpRequest' -H 'DNT: 1' -H 'Connection: keep-alive' --data 'disabledRules=WHITESPACE_RULE&allowIncompleteResults=true&enableHiddenRules=true&useragent=ltorg&text=12,75+%D0%B3%D1%80%D0%BD.&language=uk&altLanguages=en-US'

diff --git a/languagetool-core/src/main/java/org/languagetool/ b/languagetool-core/src/main/java/org/languagetool/
index 600d60726e..79d9d93d30 100644
--- a/languagetool-core/src/main/java/org/languagetool/
+++ b/languagetool-core/src/main/java/org/languagetool/
@@ -697,7 +697,10 @@ public class JLanguageTool {
     unknownWords = new HashSet<>();
     List<AnalyzedSentence> analyzedSentences = analyzeSentences(sentences);
+    System.out.println("== " + annotatedText.getPlainText());
+    System.out.println("-- " + sentences);
+    System.out.println(":: " + analyzedSentences);
     List<RuleMatch> ruleMatches = performCheck(analyzedSentences, sentences, allRules, paraMode, annotatedText, listener, mode);
     ruleMatches = new SameRuleGroupFilter().filter(ruleMatches);
     // no sorting: SameRuleGroupFilter sorts rule matches already

I get an error when I give 12,75 грн. as input to the command-line version, but not with 12,75 грн.\n. So I don’t think this is a difference between command-line and REST, but rather subtle differences in the input?

Ok, it looks like the newline at the beginning is inserted by editor_plugin2.js (I used local web page to test this initially), particularly this code does that:

var plainText = tinyMCE.activeEditor.getContent({ format: 'raw' })
        .replace(/<p>/g, "\n\n")
        .replace(/<br>/g, "\n")
        .replace(/<br\s*\/>/g, "\n")

Unfortunately it’s not visible in browser debugger and when I copied curl data it was on page which does not seem to have this problem.
We probably could prevent this from happening by removing leading

instead of converting it to \n\n but I am more interested in fixing false positive which happens at the end of the text.

Ok, I’ve found the problem: I had negate_pos="yes" token="SENT_END" condition.
It works if period only has SENT_END. But if it has SENT_END and PARA_END this condition stops working.
This is actually more generic: I have another condition in uk rules (find tokens marked as slang but don’t have no non-slang reading):

    <token postag_regexp="yes" postag=".*:slang.*">
      <exception negate_pos="yes" postag_regexp="yes" postag=".*:slang.*"/>

and it works properly if the token does not have SENT_END. But if I make this word the last token in the sentence it gets additional SENT_END tag and my condition stops working.

I see 3 ways to fix this:

  1. every time we use negate_pos we need to remember SENT_END/PARA_END can be added “randomly” (based on how sentence is positioned). Note: in xml rule tests the SENT_END is added so you can test it but PARA_END is never added so it’s easier to miss it
  2. we treat SENT_END/PARA_END specially in negate_pos (probably not good, as you may want to have them counted in)
  3. allow moving SENT_END/PARA_END after last token, e.g. I’ve noticed many times that asymetry of having SENT_START before first token but SENT_END on the last token makes some rules more complicated. It’s a bit change so this could be setting based on language. This approach will allow to have word tokens have only POS tags, leaving sentence marker tags outside.