Back to LanguageTool Homepage - Privacy - Imprint

Case sensitive postag check

(Ruud Baars) #1

Can I do this?
Example, after a comma, there is no need for uppercase. Except for words that only exist with an uppercase, like proper names and derivatives of those.

For this rule, I need ‘, French’ not to be matched, but ‘, Good’ should be, as well as ‘, The’, but not ‘, the’.

Any ideas?

(Mike Unwalla) #2

Try this basic rule.

<rule id="UPPERCASE_AFER_COMMA" name="Uppercase after comma">
            <token spacebefore="yes" case_sensitive="yes" regexp="yes">[A-Z].*<exception postag_regexp="yes" postag="NNPS?"/></token>
    <message>Use initial uppercase only for proper nouns. Do you mean <suggestion><match no="2" case_conversion="alllower"/></suggestion>?</message>
    <example correction="tea">I like coffee, <marker>Tea</marker>, and gin.</example>
    <example>I like coffee, <marker>tea</marker>, and gin.</example>
    <example>I speak English, <marker>French</marker>, and Russian.</example>

You will need to write another rule to deal with text that is all capitalized or has initial capitals.

(Ruud Baars) #3

Yes, that is roughly What I have. But the amount of false positives is much too large, since words like ‘Spanish, German, European, Hendrikx’ should be excluded. I can exclude proper names by postag, but not the adjectives.

An alternative is to tag them by adding an extra property in the ‘postag’, but that requires to change all rules.

In fact, I think the postag-retrieval should be the same as for spellcheck: use thee word exactly first; if not found, then check firstlowered version. I will make a github issue for that.

(Tiago F. Santos) #4

@Mike_Unwalla This rule seems useful. Have you already added it to the English version?

(Mike Unwalla) #5

@tiago, no, I haven’t added it. (I never thought to add it!). I’ve done no testing on it. It needs more work to prevent false positives on text that contains a series of initial capitals. Feel free to modify, test, and add to LT.

(Lodewijk Arie van Brienen) #6

Question: What about acronym-elaboration?
EG: the ‘STER’, STichting Ether Reclame, is responsible for the broadcasting of commercials on the Dutch Public Broadcasting Channels.

(Tiago F. Santos) #7

@Mike_Unwalla I understand. I have added yesterday a little tested rule, which produced many false positives. It is already fixed but today the regression output will be long. I will add that rule tomorrow and try to fix it accordingly, but I guess that it will face the same issues that MISSING_GENITIVE had. Many words are not marked as proper names, and it would be impossible to cover all situations.
Depending on the amount of false positives, we’ll see if it can be kept as default='on'.

(Mike Unwalla) #8

That capitalization is not standard English. The term is a proper noun and should have postag NNP (but refer to the comment by @tiagosantos – it’s impossible to know all the proper nouns.)

(Ruud Baars) #9

What the rule really needs is a way to determine if a captial is required or not. But unlike spellcheck, postagging is case-unaware. I added an issue in Github to make it work as spellchecking does (more or less).

Another option is to add a ‘flag’ ‘needs upper case’ to words that have that, but that requires all rules to be adjusted. There is also an item in Github to add ‘flags’ to words , much like postags, but not in the postag field, but separate attribute. Problem is there is no Java programmer for Dutch.

(Tiago F. Santos) #10

@Ruud_Baars POS tagging is case aware in most languages I have looked into. Portuguese and English had case aware rules added yesterday, and although they are still being worked on you can see the results in the regression tests.
And looking at languagetool/languagetool-language-modules/nl/src/main/java/org/languagetool/tagging/nl/ it seems to be using the vanilla tagger, i.e. case aware. Probably it is only the POS dictionary that is missing the capitalized proper names, or you need to disambiguate. Look at git log for examples.

(Ruud Baars) #11

I will have to check it out a bit more. One of the things that will help a bit is making an exception for “UNKNOWN”. Problem stays that it is not possible to detect if a word is valid only when capitalized. Those are the words one will want to discard.
But making

<token postag="UNKNOWN"><match no="2" case_conversion="startlower"/></token>

is not allowed, neither is

> <antipattern><token>,</token><token postag="UNKNOWN" case_conversion="startlower"/></antipattern>

Maybe a property 'case= with values “asindict, firstuppered, fulluppered” would be a possible addition to the available options?

(Tiago F. Santos) #12

Try this for ideas:

(Ruud Baars) #13

Apart from brute force elimination of all those words with a capital would be madness for Dutch. There are just too many (66000 !), since we capitalize lots of words, as long as they are related to a proper name, but are not a proper name by itself. Holland is the informal proper name of our country, Hollands, Hollandse, Hollandser (etc.) adjective forms, Hollander(s) the inhabitants. And this applies to every place, country, continent, province.