The tokenizer needs to be extremely fast. And it probably is. But the same applies to the segmentation routine.
The latter seems to be a lot more flexible by design. But it would make some alterations to the tokenizer quite easy, like stitching numbers, addresses, urls, parts with apostrophes together.
Would it really slow down LT using the segmentator as a tokenizer?