Common vs uncommon

Most of the rules we are making are rules that catch erroneous, more or less uncommon word and tag arrays.

I would like to identify all common and correct word arrays. My idea to do so is:

  • collect ngrams (n = 2 up to 5 at least)
  • process the ngrams replacing words with postags where there is only one postag for it; otherwise use the word
  • count their occurrence (most occurring are probably best)
  • create complete rules from the lowest scoring part
  • test these on a corpus.

Or:
make a rule:

  • make one pattern catching a ngram of 5
  • make antipatterns for all very common 5-grams
    (challenge is finding a fitting example…)