Token count mismatch?

  				<token postag="SENT_START"/>
  				<token/>
  				<token/>
  				<token postag="SENT_END" regexp="yes">[.!?]</token>

results in
Wake up !
to be matched.
from
… waar slechts één taal gesproken wordt. Wake up!

So the sent start token is a space. Why?

I’m not sure I understand your issue. What would you want to match and what should not match? You can use Tekstanalyse – LanguageTool to see how a sentence is analyzed internally.

I specified 4 tokens. The matched sentence has just 4. That is the issue.

<token postag="SENT_START"/> does not correspond to anything visible, in particular it is not the first word. It’s a little confusing.

I can live with that.