Tokenizing specifics via REST API

arysin · January 24, 2019, 5:59pm

I was looking into the problem that “12,75 грн.” (correctly) does not generate error in rule tests and command line, but does create false positive via REST API.
If I add text after this sentence the error goes away.

It seems like there are some differences in tokenizing/tagging, e.g. command line:
[<S> 12,75[12,75/number] грн[грн/noun:inanim:f:v_dav:nv:abbr,грн/noun:inanim:f:v_mis:nv:abbr,грн/noun:inanim:f:v_naz:nv:abbr,грн/noun:inanim:f:v_oru:nv:abbr,грн/noun:inanim:f:v_rod:nv:abbr,грн/noun:inanim:f:v_zna:nv:abbr,грн/noun:inanim:p:v_dav:nv:abbr,грн/noun:inanim:p:v_mis:nv:abbr,грн/noun:inanim:p:v_naz:nv:abbr,грн/noun:inanim:p:v_oru:nv:abbr,грн/noun:inanim:p:v_rod:nv:abbr,грн/noun:inanim:p:v_zna:nv:abbr].[</S>] ]

If I add output in JLanguageTool running on the server I get this:
[<S> </S> , <S> 12,75[12,75/number] грн[грн/noun:inanim:f:v_dav:nv:abbr,грн/noun:inanim:f:v_mis:nv:abbr,грн/noun:inanim:f:v_naz:nv:abbr,грн/noun:inanim:f:v_oru:nv:abbr,грн/noun:inanim:f:v_ rod:nv:abbr,грн/noun:inanim:f:v_zna:nv:abbr,грн/noun:inanim:p:v_dav:nv:abbr,грн/noun:inanim:p:v_mis:nv:abbr,грн/noun:inanim:p:v_naz:nv:abbr,грн/noun:inanim:p:v_oru:nv:abbr,грн/noun:inanim:p:v_rod: nv:abbr,грн/noun:inanim:p:v_zna:nv:abbr].[</S>]]

So there are two things here:

Empty sentence is added at the front when tokenizing via REST API
The ending period gets “concatenated”(?) tag [</S>], instead of [</S>] in command line. I suspect due to [</S>] tag my <token negate_pos="yes" postag="SENT_END">.</token> does not work.

dnaber · January 24, 2019, 7:00pm

How exactly do you test this? Can you send the curl calls, for example?

arysin · January 24, 2019, 9:35pm

I just typed 12,75 грн. into https://languagetool.org
I the browser network tool (F12) I see the request is correct (matching the text entered).
So I’ve added output for original, tokenized, and tagged sentences in JLanguageTool and submitted this request on local server.

curl 'https://languagetool.org/api/v2/check' -H 'User-Agent: Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' -H 'Accept: */*' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Referer: https://languagetool.org/' -H 'Content-Type: application/x-www-form-urlencoded; charset=UTF-8' -H 'X-Requested-With: XMLHttpRequest' -H 'DNT: 1' -H 'Connection: keep-alive' --data 'disabledRules=WHITESPACE_RULE&allowIncompleteResults=true&enableHiddenRules=true&useragent=ltorg&text=12,75+%D0%B3%D1%80%D0%BD.&language=uk&altLanguages=en-US'


diff --git a/languagetool-core/src/main/java/org/languagetool/JLanguageTool.java b/languagetool-core/src/main/java/org/languagetool/JLanguageTool.java
index 600d60726e..79d9d93d30 100644
--- a/languagetool-core/src/main/java/org/languagetool/JLanguageTool.java
+++ b/languagetool-core/src/main/java/org/languagetool/JLanguageTool.java
@@ -697,7 +697,10 @@ public class JLanguageTool {
 
     unknownWords = new HashSet<>();
     List<AnalyzedSentence> analyzedSentences = analyzeSentences(sentences);
-    
+    System.out.println("== " + annotatedText.getPlainText());
+    System.out.println("-- " + sentences);
+    System.out.println(":: " + analyzedSentences);
+
     List<RuleMatch> ruleMatches = performCheck(analyzedSentences, sentences, allRules, paraMode, annotatedText, listener, mode);
     ruleMatches = new SameRuleGroupFilter().filter(ruleMatches);
     // no sorting: SameRuleGroupFilter sorts rule matches already

dnaber · January 24, 2019, 10:27pm

I get an error when I give 12,75 грн. as input to the command-line version, but not with 12,75 грн.\n. So I don’t think this is a difference between command-line and REST, but rather subtle differences in the input?

arysin · January 25, 2019, 4:02am

Ok, it looks like the newline at the beginning is inserted by editor_plugin2.js (I used local web page to test this initially), particularly this code does that:

var plainText = tinyMCE.activeEditor.getContent({ format: 'raw' })
        .replace(/<p>/g, "\n\n")
        .replace(/<br>/g, "\n")
        .replace(/<br\s*\/>/g, "\n")

Unfortunately it’s not visible in browser debugger and when I copied curl data it was on https://languagetool.org page which does not seem to have this problem.
We probably could prevent this from happening by removing leading

instead of converting it to \n\n but I am more interested in fixing false positive which happens at the end of the text.

arysin · January 25, 2019, 3:34pm

Ok, I’ve found the problem: I had negate_pos="yes" token="SENT_END" condition.
It works if period only has SENT_END. But if it has SENT_END and PARA_END this condition stops working.
This is actually more generic: I have another condition in uk rules (find tokens marked as slang but don’t have no non-slang reading):

    <token postag_regexp="yes" postag=".*:slang.*">
      <exception negate_pos="yes" postag_regexp="yes" postag=".*:slang.*"/>
    </token>

and it works properly if the token does not have SENT_END. But if I make this word the last token in the sentence it gets additional SENT_END tag and my condition stops working.

I see 3 ways to fix this:

every time we use negate_pos we need to remember SENT_END/PARA_END can be added “randomly” (based on how sentence is positioned). Note: in xml rule tests the SENT_END is added so you can test it but PARA_END is never added so it’s easier to miss it
we treat SENT_END/PARA_END specially in negate_pos (probably not good, as you may want to have them counted in)
allow moving SENT_END/PARA_END after last token, e.g. I’ve noticed many times that asymetry of having SENT_START before first token but SENT_END on the last token makes some rules more complicated. It’s a bit change so this could be setting based on language. This approach will allow to have word tokens have only POS tags, leaving sentence marker tags outside.