Token count mismatch?

(Ruud Baars) #1
  				<token postag="SENT_START"/>
  				<token postag="SENT_END" regexp="yes">[.!?]</token>

results in
Wake up !
to be matched.
… waar slechts één taal gesproken wordt. Wake up!

So the sent start token is a space. Why?

(Daniel Naber) #2

I’m not sure I understand your issue. What would you want to match and what should not match? You can use to see how a sentence is analyzed internally.

(Ruud Baars) #3

I specified 4 tokens. The matched sentence has just 4. That is the issue.

(Jan Schreiber) #4

<token postag="SENT_START"/> does not correspond to anything visible, in particular it is not the first word. It’s a little confusing.

(Ruud Baars) #5

I can live with that.