Common vs uncommon

Ruud_Baars · July 23, 2019, 9:36am

Most of the rules we are making are rules that catch erroneous, more or less uncommon word and tag arrays.

I would like to identify all common and correct word arrays. My idea to do so is:

collect ngrams (n = 2 up to 5 at least)
process the ngrams replacing words with postags where there is only one postag for it; otherwise use the word
count their occurrence (most occurring are probably best)
create complete rules from the lowest scoring part
test these on a corpus.

Or:
make a rule:

make one pattern catching a ngram of 5
make antipatterns for all very common 5-grams
(challenge is finding a fitting example…)