Most of the rules we are making are rules that catch erroneous, more or less uncommon word and tag arrays.
I would like to identify all common and correct word arrays. My idea to do so is:
- collect ngrams (n = 2 up to 5 at least)
- process the ngrams replacing words with postags where there is only one postag for it; otherwise use the word
- count their occurrence (most occurring are probably best)
- create complete rules from the lowest scoring part
- test these on a corpus.
Or:
make a rule:
- make one pattern catching a ngram of 5
- make antipatterns for all very common 5-grams
(challenge is finding a fitting example…)