<token postag="SENT_START"/> <token/> <token/> <token postag="SENT_END" regexp="yes">[.!?]</token>
results in
Wake up !
to be matched.
from
… waar slechts één taal gesproken wordt. Wake up!
So the sent start token is a space. Why?
<token postag="SENT_START"/> <token/> <token/> <token postag="SENT_END" regexp="yes">[.!?]</token>
results in
Wake up !
to be matched.
from
… waar slechts één taal gesproken wordt. Wake up!
So the sent start token is a space. Why?
I’m not sure I understand your issue. What would you want to match and what should not match? You can use Tekstanalyse – LanguageTool to see how a sentence is analyzed internally.
I specified 4 tokens. The matched sentence has just 4. That is the issue.
<token postag="SENT_START"/>
does not correspond to anything visible, in particular it is not the first word. It’s a little confusing.
I can live with that.