Sentences not broken for ending with \u202f

arysin · February 11, 2018, 3:57am

Hi all

i just found some false positives for Ukrainian and the reason was the sentences were not properly broken.
The sequence is a bit odd '.\u202f ’ but it happens in my media archives much more often than I expected. I must say they are collected from the web and converted from html into txt so this may not happen for your regular texts much.
I’ve tested this to behave the same for English (at Text Analysis - LanguageTool). If you enter “You eat. Tomorrow will come.” where first space after the dot is \u202f you’ll see the sentence is not broken into two.
I’ve put a workaround for Ukrainian into segment.srx but just wanted to let everybody know in case we want fixes in other languages or maybe in SRX.

Andriy