Machine Learning – How to Handle Distribution Differences Between Test and Training Sets

classificationmachine learningmulti-classskewnessunbalanced-classes

I think one basic assumption of machine learning or parameter estimation is that the unseen data come from the same distribution as the training set. However, in some practical cases, the distribution of the test set will almost be different from the training set.

Say for a large scale multi-classification problem that tries to classify product descriptions into about 17,000 classes. The training set will have highly skewed class priors, such that some class might have many training examples, but some might have just a few. Suppose we are given a test set with unknown class labels from a client. We try to classify each product in the test set into one of the 17,000 classes, using the classifier trained on the training set. The test set would probably have skewed class distributions but probably very different from that of the training set, since they might be related to different business areas. If the two class distributions are very different, the trained classifier might not work well in the test set. This seems especially obvious with the Naive Bayes classifier.

Is there any principled way to handle the difference between the training set and a particular given test set for probabilistic classifiers? I heard about that "transductive SVM" does similar thing in SVM. Are there similar techniques to learn a classifier that performs best on a particular given test set? Then we can retrain the classifier for different given test sets, as is allowed in this practical scenario.

Best Answer

If the difference lies only in the relative class frequencies in the training and test sets, then I would recommend the EM procedure introduced in this paper:

Marco Saerens, Patrice Latinne, Christine Decaestecker: Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1): 21-41 (2002) (www)

I've used it myself and found it worked very well (you need a classifier that outputs a probability of class membership though).

If the distribution of patterns within each class changes, then the problem is known as "covariate shift" and there is an excellent book by Sugiyama and Kawanabe. Many of the papers by this group are available on-line, but I would strongly recommend reading the book as well if you can get hold of a copy. The basic idea is to weight the training data according to the difference in density between the training set and the test set (for which labels are not required). A simple way to get the weighting is by using logistic regression to predict whether a pattern is drawn from the training set or the test set. The difficult part is in choosing how much weighting to apply.

See also the nice blog post by Alex Smola here.

Related Question