Confusion rule question

Is the process of getting the best parameters for the confusion rule handicapped by Google leaving out ngrams with low counts?

It would be worth cathing relatively scarse confusions, would it not?

The Google ngram data has a limit of 40 occurrences, i.e. there can’t be fewer than that (actually there can, as we filter old documents, but …). The data we evaluate on is not the ngram data set itself but Wikipedia etc. Having no limit here would mean we get maybe more confusion pairs, but can be less sure about their quality.