Tokenizing

Ruud_Baars · November 1, 2018, 2:28pm

At least for Dutch, sentences are cut using SRX so that sentences are not split after abbreviations that have a full stop.
But that is not true for tokenization; etc. is split into 2 tokens.
Of course, this could be different for language, but I assume it is commonly done like this.
But why? There is no use of the period as a token by itself. And the postag of the abbreviation (when applicable) has to be based upon the word plus the full stop.

Example:
max : wrong
max. : correct, maximal; adjective
Max: correct : proper name

How do other rule developers deal with this?

arysin · November 1, 2018, 4:08pm

For Ukrainian we had multiple iterations of tokenization improvements. And usually the changes to support abbreviations with period require synchronized changes to both SRX rules and the word tokenizer class.
Usually abbreviations that do not overlap with non-abbreviated words are quite easy to deal with, but if the word can be use both with period and without may require some hairy logic.

Ruud_Baars · November 2, 2018, 6:41am

I know; some are quite simple, like ‘o.a.’ and ‘jl.’, but there are some that completely overlap.
I managed to get it working relatively well in php.
I might have a look at the Java code to see if it is just as easily adjusted.
It might be as simple as not tokenizing before a . , unless it is the last character…