Changes in the English tokenizer: contractions

jaumeortola · March 2, 2021, 11:32am

In English, contractions like “doesn’t” or “Harper’s” were tokenized this way:

<token>doesn</token>
<token>'</token>
<token>t</token>

<token>Harper</token>
<token>'</token>
<token>s</token>

Linguistically, this tokenization doesn’t make sense, and it makes sentence analysis and rule creation more difficult.

Starting today, the tokenization will be different. See for example:

does[do/VBZ]n’t[not/RB]
Harper[Harper/NNP,harper/NN]'s['s/POS]
It[it/PRP]'s[be/VBZ] good[good/JJ,good/NN:U]

“Can’t” and “won’t” are special cases:

ca[can/MD]n’t[not/RB]
wo[will/MD]n’t[not/RB]

Now, you can write patterns like these, valid for “doesn’t” and “does not” at the same time:

<pattern>
    <token>does</token>
    <token regexp="yes">not|n't</token>
</pattern>

Or:

<pattern>
    <token regexp="yes">it|he|she</token>
    <token regexp="yes">is|'s</token>
</pattern>

These rules are written only with straight apostrophes, and they will match both straight (typewriter) and curly (typographical) apostrophes.