What would be a good strategy to try to detect these? In essence, (from the 2nd link), nominalizations contain up to three elements. Sometimes you see only one or two of them; other times, all three appear.
A word such as a, an, the, his, her, these, or several.
A noun such as utilization, sadness or taking. This is the only element that always appears in nominalizations.
The word of.
Another way to detect nominalizations is to check for nouns that end in -tion (or) -ing.
Is the solution to only do the check for -tion and -ing and then do antipatterns and exclusions over time to weed out false positives? There seems to be some work on this in the German variant, but i could not follow the conversation unfortunately.
I’m not sure what feature exactly you refer to when you mention German in this context. Anyway, I guess it makes sense to have a list of nouns where a nominalization is typically replaceable. If you search for all nouns then you’d need to add an almost unlimited number of antipatterns.
Tat wrote: Another way to detect nominalizations is to check for nouns that end in -tion (or) -ing.
Daniel wrote: If you search for all nouns then you’d need to add an almost unlimited number of antipatterns.
You can partly solve the problem that Daniel mentioned by finding verb+tion, verb+ity, verb+able and so on, rather than all nouns. Use EnglishPartialPosTagFilter. Possibly, you could do more complex processing to find nouns in which the base form of the verb is changed, for example, quantify > quantifiable.
Use something like the following. I have not tested the code. I copied/pasted and simplified from a rule that I use:
<pattern>
<token regexp="yes">([a-z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)<exception postag="NNP"/></token>
</pattern>
<filter class="org.languagetool.rules.en.EnglishPartialPosTagFilter"
args="no:1 regexp:([a-zA-Z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)\b postag_regexp:VB"/>
<message>As an alternative to the noun '<match no="1"/>', use the verb '<match no="1" regexp_match="([a-zA-Z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)\b" regexp_replace="$1" case_conversion="allupper"/>'.</message>
Thank you so much for the example. I was not able to get it to work. Below is what I tried and was hoping you could look it over.
<rule default="off" id="NOMINALIZATION" name="Nominalization">
<pattern>
<token regexp="yes">
([a-zA-Z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)
<exception postag="NNP"/>
</token>
</pattern>
<!--
<filter
class="org.languagetool.rules.en.EnglishPartialPosTagFilter"
args="no:1 regexp:([a-zA-Z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)\b postag_regexp:VB."
/>
-->
<message>
This is a nominalization.
</message>
</rule>
First I took out the message and the filter. The idea was to understand what you are saying.
This works as expected as the word ‘evaluation’ is [evaluation/NN:UN,E-NP-singular].
Now, if I understand Class PartialPosTagFilter, it needs:
no: token postion
the regexp in question
and the postag_regexp
So the idea is, filter the matches from the original pattern to only show where the part of the token has the required tag. Note i changed the postag_regexp you had from VB to VB. - but i don’t think that makes a huge difference. This did not work.
How do I tell why it did not work? That is, how do I see the partial pos which the filter is evaluating? I think this did not work as there is nothing which says look at the inflected version… Looking at dictionary.dump. I tried adding inflected="yes" to the token tag but no luck.
The other thing I don’t quite understand is the argument to the EnglishPartialPosTagFilter. How is no:1 splitting out to the relevant portion of the regexp, i.e. given the documentation suggests that the partialpostagger looks at the first (.*), why bother telling it the position of the token, unless it is for multiple tokens?
regexp: the regular expression to specify the part of the token to be considered. For example, (?:in|un)(.*) will consider the part of the token that comes after 'in' or 'un'. Note that always the first group is considered, so if you need more parenthesis you need to use non-capturing groups (?:...), as in the example.
So I tried below, but this did not work either. I’m pretty sure it is just setting up the regexp correctly, but not having a lot of luck.
<rule default="off" id="NOMINALIZATION" name="Nominalization">
<pattern>
<token regexp="yes">([a-z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)<exception postag="NNP"/></token>
</pattern>
<filter class="org.languagetool.rules.en.EnglishPartialPosTagFilter"
args="no:1 regexp:([a-zA-Z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)\b postag_regexp:VB"/>
<message>As an alternative to the noun '<match no="1"/>', use the verb '<match no="1" regexp_match="([a-zA-Z]+)(?:ability|abilities|able|ably|ation|ations|ible|ibly|ment|ments)\b" regexp_replace="$1" case_conversion="allupper"/>'.</message>
<example correction="">The <marker>testability</marker> of this rule is important.</example>
<example correction="">The <marker>testabilities</marker> of rules are important.</example>
<example correction="">This rule finds nonsense words: If the <marker>evaluateability</marker> of rules is important...</example>
<example correction=""><marker>Countable</marker> nouns are ...</example>
<example correction=""><marker>Findabilities</marker> of the words are important.</example>
<example correction=""><marker>Countabilities</marker> of the numbers are important.</example>
<!-- <example correction=""><marker>Testabilities</marker> of rules are important.</example>-->
<example correction=""><marker>Countabilities</marker> of the numbers are important.</example>
<example correction="">If the noun is <marker>countable</marker>, stop the test.</example>
<example correction="">... but if the <marker>countments</marker> are not correct...</example>
<example correction="">A '<marker>casement</marker>' is a type of window.</example>
<example correction="">Use the <marker>CASEments</marker> software to ...</example>
<!-- <example correction="">Use the <marker>CaSements</marker> software to ...</example>-->
<example>Make sure that the <marker>rations</marker> are sufficient.</example>
<example>This is an <marker>evaluation</marker>. (Verb base form is 'evaluate', not 'evaluat', thus no match.)</example>
<example>Dr. <marker>Countable</marker> is friendly.</example>
<example>The uppercase word <marker>COUNTABLE</marker> is out of scope of this rule. (Uppercase ABLE does not match the filter.)</example>
<example>Camel-case word <marker>CaseMent</marker> is out of scope of this rule. (Uppercase M does not match the filter.)</example>
<example>The <marker>correctsability</marker> of errors is important. ('Corrects' is not the base form. But, with postag_regexp:VB. the rule finds 'correctsability'.)</example>
</rule>
I think that the examples and comments answer your questions.
Your comment 6: yes, the ‘no:1’ is the token position.
The rule finds nonsense words. For the STE checker, that is fine. For standard English (and VOA English), you probably do not want that. So, include a postag on the token to make sure that the rule finds only standard nouns.
Two examples are in comments. I expect the rule to find the marked text, but it does not (and testrules gives an error message). The rule can give unexpected (but correct) results (refer to https://sourceforge.net/p/languagetool/mailman/message/34818821/). But, I cannot see why the rule does not find the words ‘Testabilities’ and ‘CaSements’.
With the filter, it does not pick up evaluation at all. How can I tell if the filter is playing nicely? I have no way of seeing what the filter is doing so it is trial and error mostly.
With the filter, the rule does not find ‘evaluation’, because ‘evalu’ does not have postag VB.
The rule is only a partial solution to your problem. Refer to my first reply. If you remove the filter and find all nouns that end with (ability|abilities|able…), you will get false positives, as the ‘ration’ example shows.
I don’t know how to ‘look’ inside the filter to know what it finds or does not find. @danielnaber, is it possible to see the analysis of the filter? If no, is it technically possible to add a debug feature to the filter (or to testrules) that shows how the filter analyses a token?
@dnaber - much appreciate your input. I need to start playing with the Java code base to get a better feel of this going forward.
@Mike_Unwalla - it should be fine. I’m using it for a personal project, so it meets the use case of flagging the user as opposed to flagging with examples and alternatives. The user in this case is just me (I’m using it to help out on a blogging project), to assist with my writing.