Identifying tokens ending with 'th'

nancyntse13 · June 21, 2017, 10:11am

How can we identify tokens ending with ‘th’ like fourth, tenth etc.

<token spacebefore="no">th</token>

The above rule does not seem to be working.

Jan_Schreiber · June 21, 2017, 10:50am

<token regexp="yes">\w+th</token>

Should work for most use cases.

nancyntse13 · June 21, 2017, 11:41am

Thanks for the help. What is the '\w’used for?

Jan_Schreiber · June 21, 2017, 3:44pm

Any alphanumeric character. The plus sign means “one or more of the preceding expression”. So \w+th means “a string of letters and numbers of any length, followed by th.”
A slightly safer way to express \w would be [a-z0-9äöüßéèáà] if you need accented characters, because they are not included in \w afaik.
You might want to check out this interactive regex tutorial. Quite useful.

nancyntse13 · June 28, 2017, 9:54am

This worked fine but raised false alarms in words like ‘with’. Any way around?

Jan_Schreiber · June 28, 2017, 10:15am

<token regexp="yes">\w+th<exception regexp="yes">with|smith|width</exception></token>

jaumeortola · June 28, 2017, 11:05am

It is not that simple. This is the best I have been able to do:

<rule>
    <pattern>
        <token regexp="yes" postag="JJ">\w+th<exception postag="NN.+|V.*" postag_regexp="yes"/><exception regexp="yes">north|south</exception></token>
    </pattern>
    <message>Did you mean <suggestion>aaa</suggestion>?</message>
    <example correction="aaa"><marker>sixth</marker>.</example>
    <example correction="aaa"><marker>6th</marker>.</example>
    <example>north, south, width, with, smooth</example>
    <example>The North Slope is mostly tundra peppered</example>
</rule>

Probably it’s better to do this:

nancyntse13 · June 28, 2017, 12:28pm

Thank You.