[Solved] How to indicate JJ minus NN?

Kumara · November 23, 2016, 7:58am

I want to create a rule that detects misspelling of “quite” as “quiet”. To do that I want to indicate the following token as adjectives that aren’t also nouns or prepositions. This doesn’t seem to work:

 <token postag='JJ'><exception postag_regexp="yes" postag='NN|RP'/></token>`

“Winter” and “animal” still gets included. Am I missing sth?

Here’s the whole rule (in case you want to know):

<!-- English rule, 2016-11-23 -->
    <rule id="MISSPELLING_QUIET_QUITE" name="Misspelling: quiet (quite)">
     <pattern>
      <marker><token>quiet</token></marker>
      <token postag="JJ"><exception postag_regexp="yes" postag="NN|RP"/></token>
     </pattern>
     <message>Do you mean <suggestion>quite</suggestion>?</message>
     <short>Possible misspelling</short>
     <example correction=''>It has become <marker>quiet</marker> troublesome.</example>
     <example>It has become quite troublesome.</example>
     <example>It has become quiet.</example>
    </rule>

Kumara · November 23, 2016, 8:02am

For now, I’m settling for
<token postag='JJ'><exception postag="RP"/><exception regexp="yes">today|all|animal|winter|now</exception></token>

It’s not so elegant, and will trigger false positive that’s not among the test sentences.

dnaber · November 23, 2016, 8:13am

You can use Text Analysis - LanguageTool, it will show that “winter” and “animal” are tagged as NN:UN .

dnaber · November 23, 2016, 8:23am

Feel free to write those rules of course, but please be aware that we already have an ngram based rule for that:

github.com

languagetool-org/languagetool/blob/master/languagetool-language-modules/en/src/main/resources/org/languagetool/resource/en/confusion_sets.txt#L558


      
          won -> own; 1000000000;                              # p=0.999, r=0.933, f0.5=0.986, s=0.999, 1986+1981, 3grams, 2022-06-23, fp=1, fn=132, tp=1849, tn=1985, {null=2002}, {null=2002}
          #paced;paste;1000                                     # p=1.000, r=0.953, 40+194, 3grams, 2015-12-10
          packed;pact;1000                                     # p=1.000, r=0.878, 249+228, 3grams, 2015-12-10
          pail;pale;10000000                                   # p=0.991, r=0.265, 749+1000, 3grams, 2016-02-27
          pails;pales;100000                                   # p=0.991, r=0.468, 225+473, 3grams, 2016-02-27
          pairs;pares;10000000                                 # p=0.997, r=0.405, 998+669, 3grams, 2016-02-27
          palate; palette; 100000;                             # p=1.000, r=0.395, 34+95, 3grams, 2016-10-25
          pare;pear;10000000                                   # p=0.991, r=0.216, 1000+1000, 3grams, 2016-02-27
          pares;pears;10000000                                 # p=0.996, r=0.325, 669+999, 3grams, 2016-02-27
          passed; past; 100000                                 # p=0.999, r=0.858, 1938, 3grams, 2015-08-12
          patients -> patience; 10000;                         # p=1.000, r=0.352, f0.5=0.731, s=1.000, 1873+1343, 3grams, 2022-06-23, fp=0, fn=870, tp=473, tn=1873, {null=2002}, {null=2002}
          patience -> patients; 1000000;                       # p=0.993, r=0.760, f0.5=0.936, s=0.993, 1343+1873, 3grams, 2022-06-23, fp=10, fn=450, tp=1423, tn=1333, {null=2002}, {null=2002}
          peace; piece; 1000000                                # p=0.998, r=0.634, 1966, 3grams, 2015-08-12
          peaked;peeked;10000000                               # p=0.997, r=0.522, 1000+137, 3grams, 2016-02-27
          peak; peek; 100000                                   # p=0.999, r=0.797, 999, 3grams, 2015-08-12
          pealed;peeled;10000                                  # p=0.991, r=0.753, 39+1000, 3grams, 2016-02-27
          peal;peel;10                                         # p=1.000, r=0.619, 3+123, 3grams, 2015-12-10
          peals; peels; 10000000;                              # p=0.996, r=0.322, 157+639, 3grams, 2016-11-10
          pea;pee;100000                                       # p=1.000, r=0.167, 1000+1000, 3grams, 2016-02-27
          peeked;piqued;10000000                               # p=0.995, r=0.486, 137+700, 3grams, 2016-02-27
          peek;pique;10000000                                  # p=0.993, r=0.209, 1000+926, 3grams, 2016-03-08

r=0.808 in that line means that we can detect about 80% of wrongly used quite/quiet pairs, i.e. the ngram “rule” works in both directions. Technical details are documented at Finding errors using Big Data - LanguageTool Wiki

Kumara · November 23, 2016, 11:02pm

Oh, now I get it. I should specify NN.* instead. I now have

 <token postag="JJ"><exception postag_regexp="yes" postag="NN.*|RP|DT|PDT|VB.*"/></token>

That took away a lot of false positives, but “winter” remains!

I understand this is probably due to the disambiguation rules (which I don’t understand). Still I want to make this work. So what do I do? Use chunks instead?

Kumara · November 23, 2016, 11:05pm

I believe that “ngram based rule” thing means using statistics. Not good enough for me. Doesn’t detect “quiet problematic”. (Somehow LT regards problematic as noun.)

Kumara · November 23, 2016, 11:13pm

Nope. “Attribute ‘chunk’ is not allowed to appear in element ‘exception’.”

dnaber · November 24, 2016, 8:17am

“This attitude is quiet problematic.” is detected on languagetool.org. Whether something is considered a noun doesn’t matter for the ngram-based approach.

Text Analysis - LanguageTool will show you a “Disambiguator log” so you can get the ID of the disambiguation rule that causes this issue. You’ll then need to see if you can improve that rule in disambiguation.xml. The way it works it documented in the wiki.

Kumara · November 24, 2016, 8:34am

Ah ha! There goes another one that gets flagged there (and in languagetool.jar too), but not on my LO. Bug in the OXT?

dnaber · November 24, 2016, 8:54am

Do these errors get detected in your LibreOffice?

    I can't remember how to go their.
    I didn't now where it came from.
    Alabama has for of the world's largest stadiums.

If not, you might not have configured the ngram data directory in the LT settings inside LibreOffice.

Kumara · November 24, 2016, 9:12am

Right. And to do that I need to follow this?
http://wiki.languagetool.org/finding-errors-using-n-gram-data

dnaber · November 24, 2016, 9:26am

Yes, exactly.

Kumara · November 24, 2016, 10:04am

Not an option for me. Don’t have the luxury of an SSD.
Thanks, anyway.

Mike_Unwalla · November 24, 2016, 3:54pm

@Kumara, an SSD is not necessary. I use the n-gram data, and I don’t have an SSD.

Kumara · November 25, 2016, 2:08am

I’m sure it’s possible, Just don’t want a slower computer.