I was looking into the problem that “12,75 грн.” (correctly) does not generate error in rule tests and command line, but does create false positive via REST API.
If I add text after this sentence the error goes away.
It seems like there are some differences in tokenizing/tagging, e.g. command line: [<S> 12,75[12,75/number] грн[грн/noun:inanim:f:v_dav:nv:abbr,грн/noun:inanim:f:v_mis:nv:abbr,грн/noun:inanim:f:v_naz:nv:abbr,грн/noun:inanim:f:v_oru:nv:abbr,грн/noun:inanim:f:v_rod:nv:abbr,грн/noun:inanim:f:v_zna:nv:abbr,грн/noun:inanim:p:v_dav:nv:abbr,грн/noun:inanim:p:v_mis:nv:abbr,грн/noun:inanim:p:v_naz:nv:abbr,грн/noun:inanim:p:v_oru:nv:abbr,грн/noun:inanim:p:v_rod:nv:abbr,грн/noun:inanim:p:v_zna:nv:abbr].[</S>]<P/> ]
If I add output in JLanguageTool running on the server I get this: [<S> </S><P/> , <S> 12,75[12,75/number] грн[грн/noun:inanim:f:v_dav:nv:abbr,грн/noun:inanim:f:v_mis:nv:abbr,грн/noun:inanim:f:v_naz:nv:abbr,грн/noun:inanim:f:v_oru:nv:abbr,грн/noun:inanim:f:v_ rod:nv:abbr,грн/noun:inanim:f:v_zna:nv:abbr,грн/noun:inanim:p:v_dav:nv:abbr,грн/noun:inanim:p:v_mis:nv:abbr,грн/noun:inanim:p:v_naz:nv:abbr,грн/noun:inanim:p:v_oru:nv:abbr,грн/noun:inanim:p:v_rod: nv:abbr,грн/noun:inanim:p:v_zna:nv:abbr].[</S><P/>]]
So there are two things here:
Empty sentence is added at the front when tokenizing via REST API
The ending period gets “concatenated”(?) tag [</S><P/>], instead of [</S>]<P/> in command line. I suspect due to [</S><P/>] tag my <token negate_pos="yes" postag="SENT_END">.</token> does not work.
I just typed 12,75 грн. into https://languagetool.org
I the browser network tool (F12) I see the request is correct (matching the text entered).
So I’ve added output for original, tokenized, and tagged sentences in JLanguageTool and submitted this request on local server.
I get an error when I give 12,75 грн. as input to the command-line version, but not with 12,75 грн.\n. So I don’t think this is a difference between command-line and REST, but rather subtle differences in the input?
Ok, it looks like the newline at the beginning is inserted by editor_plugin2.js (I used local web page to test this initially), particularly this code does that:
Unfortunately it’s not visible in browser debugger and when I copied curl data it was on https://languagetool.org page which does not seem to have this problem.
We probably could prevent this from happening by removing leading
instead of converting it to \n\n but I am more interested in fixing false positive which happens at the end of the text.
Ok, I’ve found the problem: I had negate_pos="yes" token="SENT_END" condition.
It works if period only has SENT_END. But if it has SENT_END and PARA_END this condition stops working.
This is actually more generic: I have another condition in uk rules (find tokens marked as slang but don’t have no non-slang reading):
and it works properly if the token does not have SENT_END. But if I make this word the last token in the sentence it gets additional SENT_END tag and my condition stops working.
I see 3 ways to fix this:
every time we use negate_pos we need to remember SENT_END/PARA_END can be added “randomly” (based on how sentence is positioned). Note: in xml rule tests the SENT_END is added so you can test it but PARA_END is never added so it’s easier to miss it
we treat SENT_END/PARA_END specially in negate_pos (probably not good, as you may want to have them counted in)
allow moving SENT_END/PARA_END after last token, e.g. I’ve noticed many times that asymetry of having SENT_START before first token but SENT_END on the last token makes some rules more complicated. It’s a bit change so this could be setting based on language. This approach will allow to have word tokens have only POS tags, leaving sentence marker tags outside.