Declaration of long regexp in Grammar.xml

Ruud_Baars · February 24, 2018, 1:57pm

Since ‘spelled’ abbreviations like cd and dvd are treated in a special way, which has an impact on multiple rules, I would like to declare them in one spot, to be reused in grammar.xml in multiple rules.
Is that possible?
Can I use the same construction as in ‘weekdays’?
I just did, and it worked.

Yakov · February 24, 2018, 6:49pm

Yes, I use same construction for Russian rules in grammar.xml.

tiagosantos · February 24, 2018, 7:27pm

From English grammar.xml:

<!DOCTYPE rules [
    <!ENTITY weekdays "Monday|Wednesday|T(ue|hur)sday|Friday|S(atur|un)day">
    <!ENTITY abbrevWeekdays "Mon?|Tue?|Wed?|Thu?|Fri?|Sat?|Sun?">
    <!ENTITY months "January|February|March|April|May|Ju(ne|ly)|August|September|October|November|December">
    <!ENTITY abbrevMonths "Jan|Feb|Mar|Apr|Ju[ln]|Aug|Sept?|Oct|Nov|Dec">
    <!ENTITY languages "Akan|Amharic|Arabic|Assamese|Awadhi|Azerbaijani|Balochi|Bangla|Belarusian|Bengali|Bhojpuri|Burmese|Cantonese|Cebuano|Chewa|Chhattisgarhi|Chittagonian|Czech|Deccan|Dhundhari|Dutch|English|Filipino|French|Fula|Gaelic|German|Greek|Gujarati|Hakka|Haryanvi|Hausa|Hiligaynon|Hindi|Hmong|Hunanese|Hungarian|Igbo|Ilocano|Ilonggo|Indonesian|Italian|Ja[pv]anese|Jin|Kannada|Kazakh|Khmer|Kinyarwanda|Kirundi|Konkani|Korean|Kurdish|Madurese|Magahi|Maithili|Malagasy|Malay(alam)?|Malaysian|Mandarin|Marathi|Marwari|Mossi|Nepali|Odia|Oriya|Oromo|Pashto|Persian|Polish|Portuguese|Punjabi|Quechua|Romanian|Russian|Saraiki|Serbo-Croatian|Shona|Sindhi|Sinhalese|Somali|Spanish|Sundanese|Swedish|Sylheti|Tagalog|Tamil|Telugu|Thai|Turk(ish|men)|Ukrainian|Urdu|Uyghur|Uzbek|Vietnamese|Visayan|Wu|Xhosa|Xiang|Yoruba|Yue|Zhuang|Zulu"><!-- Most are from https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers -->
]>

If you want to use the entity regular expression in a rule, use &entity_name;

E.g.:

<rule id="MISSING_COMMA_AFTER_WEEKDAY" name="Missing comma after weekday">
    <pattern>
        <marker>
            <token regexp="yes">&weekdays;</token>
        </marker>
        <token regexp="yes">&months;</token>
        <token regexp="yes">[0123]?[0-9]</token>
    </pattern>
    <message>Commas set off the month in a weekday-month-day date: <suggestion>\1,</suggestion>.</message>
    <url>http://www.thepunctuationguide.com/comma.html#dates</url>
    <short>Missing comma</short>
    <example correction="Friday,">We will meet <marker>Friday</marker> July 15.</example>
    <example>He was born on Friday, August 12, 2016.</example>
</rule>

arysin · February 25, 2018, 8:44pm

Long regular expressions may be quite slow. Alternative way is to have an abbreviation tag in your POS tag dictionary and then use this postag in the rules. It would be much faster but you need to update the dictionary.

Ruud_Baars · February 26, 2018, 6:41am

It is an option. But it is mis-use of the postag, since it is no postag. It would be a kind of word-flag, an option I duscussed with Daniel, but I was the only one so far whishing it.