Back to LanguageTool Homepage - Privacy - Imprint

Identifying tokens ending with 'th'

(Praneet Khandelwal) #1

How can we identify tokens ending with 'th' like fourth, tenth etc.

<token spacebefore="no">th</token>

The above rule does not seem to be working.

(Jan Schreiber) #2
<token regexp="yes">\w+th</token>

Should work for most use cases.

(Praneet Khandelwal) #3

Thanks for the help. What is the '\w'used for?

(Jan Schreiber) #4

Any alphanumeric character. The plus sign means "one or more of the preceding expression". So \w+th means "a string of letters and numbers of any length, followed by th."
A slightly safer way to express \w would be [a-z0-9äöüßéèáà] if you need accented characters, because they are not included in \w afaik.
You might want to check out this interactive regex tutorial. Quite useful.

(Praneet Khandelwal) #5

This worked fine but raised false alarms in words like 'with'. Any way around?

(Jan Schreiber) #6

<token regexp="yes">\w+th<exception regexp="yes">with|smith|width</exception></token>

(jaumeortola) #7

It is not that simple. This is the best I have been able to do:

        <token regexp="yes" postag="JJ">\w+th<exception postag="NN.+|V.*" postag_regexp="yes"/><exception regexp="yes">north|south</exception></token>
    <message>Did you mean <suggestion>aaa</suggestion>?</message>
    <example correction="aaa"><marker>sixth</marker>.</example>
    <example correction="aaa"><marker>6th</marker>.</example>
    <example>north, south, width, with, smooth</example>
    <example>The North Slope is mostly tundra peppered</example>

Probably it's better to do this:

<token regexp="yes">\d+th|eighteenth|eighth|eightieth|eleventh|fifteenth|fifth|fiftieth|fortieth|fourteenth|fourth|nineteenth|ninetieth|ninth|hundredth|millionth|thousandth|seventeenth|seventh|seventieth|sixteenth|sixth|sixtieth|tenth|thirteenth|thirtieth|twelfth|twentieth|.*-(eighth|fifth|fourth|ninth|seventh|sixth)</token>

(Praneet Khandelwal) #8

Thank You.