Back to LanguageTool Homepage - Privacy - Imprint

Declaration of long regexp in Grammar.xml


(Ruud Baars) #1

Since ‘spelled’ abbreviations like cd and dvd are treated in a special way, which has an impact on multiple rules, I would like to declare them in one spot, to be reused in grammar.xml in multiple rules.
Is that possible?
Can I use the same construction as in ‘weekdays’?
I just did, and it worked.


(Yakov) #2

Yes, I use same construction for Russian rules in grammar.xml.


(Tiago F. Santos) #3

From English grammar.xml:

<!DOCTYPE rules [
    <!ENTITY weekdays "Monday|Wednesday|T(ue|hur)sday|Friday|S(atur|un)day">
    <!ENTITY abbrevWeekdays "Mon?|Tue?|Wed?|Thu?|Fri?|Sat?|Sun?">
    <!ENTITY months "January|February|March|April|May|Ju(ne|ly)|August|September|October|November|December">
    <!ENTITY abbrevMonths "Jan|Feb|Mar|Apr|Ju[ln]|Aug|Sept?|Oct|Nov|Dec">
    <!ENTITY languages "Akan|Amharic|Arabic|Assamese|Awadhi|Azerbaijani|Balochi|Bangla|Belarusian|Bengali|Bhojpuri|Burmese|Cantonese|Cebuano|Chewa|Chhattisgarhi|Chittagonian|Czech|Deccan|Dhundhari|Dutch|English|Filipino|French|Fula|Gaelic|German|Greek|Gujarati|Hakka|Haryanvi|Hausa|Hiligaynon|Hindi|Hmong|Hunanese|Hungarian|Igbo|Ilocano|Ilonggo|Indonesian|Italian|Ja[pv]anese|Jin|Kannada|Kazakh|Khmer|Kinyarwanda|Kirundi|Konkani|Korean|Kurdish|Madurese|Magahi|Maithili|Malagasy|Malay(alam)?|Malaysian|Mandarin|Marathi|Marwari|Mossi|Nepali|Odia|Oriya|Oromo|Pashto|Persian|Polish|Portuguese|Punjabi|Quechua|Romanian|Russian|Saraiki|Serbo-Croatian|Shona|Sindhi|Sinhalese|Somali|Spanish|Sundanese|Swedish|Sylheti|Tagalog|Tamil|Telugu|Thai|Turk(ish|men)|Ukrainian|Urdu|Uyghur|Uzbek|Vietnamese|Visayan|Wu|Xhosa|Xiang|Yoruba|Yue|Zhuang|Zulu"><!-- Most are from https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers -->
]>

If you want to use the entity regular expression in a rule, use &entity_name;

E.g.:

<rule id="MISSING_COMMA_AFTER_WEEKDAY" name="Missing comma after weekday">
    <pattern>
        <marker>
            <token regexp="yes">&weekdays;</token>
        </marker>
        <token regexp="yes">&months;</token>
        <token regexp="yes">[0123]?[0-9]</token>
    </pattern>
    <message>Commas set off the month in a weekday-month-day date: <suggestion>\1,</suggestion>.</message>
    <url>http://www.thepunctuationguide.com/comma.html#dates</url>
    <short>Missing comma</short>
    <example correction="Friday,">We will meet <marker>Friday</marker> July 15.</example>
    <example>He was born on Friday, August 12, 2016.</example>
</rule>

(Andriy) #4

Long regular expressions may be quite slow. Alternative way is to have an abbreviation tag in your POS tag dictionary and then use this postag in the rules. It would be much faster but you need to update the dictionary.


(Ruud Baars) #5

It is an option. But it is mis-use of the postag, since it is no postag. It would be a kind of word-flag, an option I duscussed with Daniel, but I was the only one so far whishing it.