You do not need the original training-corpus.txt for determining the thresholds in confusion_sets.txt (in fact, in machine learning you never use the same corpus for training and tuning the parameters of a trained model). Out-of-vocabulary words (i.e. words which are not part of the dictionary) are no problem, since they are mapped to a special “UNKOWN” token.
Thanks for great work. It seems that commandline version with word2vec works faster than with n-gram, but server version for me is about 6 times slower.