"Why exactly does a classifier need the same prevalence in the train and test sets?"
Doesn't over(/under)sampling an imbalanced dataset cause issues?
Yes, the classifier will expect the relative class frequencies in
operation to be the same as those in the training set. This means
that if you over-sample the minority class in the training set, the
classifier is likely to over-predict that class in operational use.
To see why it is best to consider probabilistic classifiers, where the
decision is based on the posterior probability of class membership
p(C_i|x), but this can be written using Bayes' rule as
$p(C_i|x) = \frac{p(x|C_i)p(C_i)}{p(x)}\qquad$ where $\qquad p(x) =
> \sum_j p(x|C_j)p(c_j)$,
so we can see that the decision depends on the prior probabilities of
the classes, $p(C_i)$, so if the prior probabilities in the training
set are different than those in operation, the operational performance
of our classifier will be suboptimal, even if it is optimal for the
training set conditions.
Some classifiers have a problem learning from imbalanced datasets, so
one solution is to oversample the classes to ameliorate this bias in
the classifier. There are to approaches. The first is to oversample
by just the right amount to overcome this (usually unknown) bias and
no more, but that is really difficult. The other approach is to
balance the training set and then post-process the output to
compensate for the difference in training set and operational priors.
We take the output of the classifier trained on an oversampled dataset
and multiply by the ratio of operational and training set prior
probabilities,
$q_o(C_i|x) \propto p_t(x|C_i)p_t(C_i) \times \frac{p_o(C_i)}{p_t(C_i}
> = p_t(x|C_i)p_o(C_i)$
Quantities with the o subscript relate to operational conditions and
those wit the t subscript relate to training set conditions. I have
written this as $q_o(C_i|x)$ as it is an un-normalised probability,
but it is straight forward to renormalise them by dividing by the sum
of $q_o(C_i|x)$ over all classes. For some problems it may be better
to use cross-validation to chose the correction factor, rather than
the theoretical value used here, as it depends on the bias in the
classifier due to the imbalance.
So in short, for imbalanced datasets, use a probabilistic classifier
and oversample (or reweight) to get a balanced dataset, in order to
overcome the bias a classifier may have for imbalanced datasets. Then
post-process the output of the classifier so that it doesn't
over-predict the minority class in operation.
It doesn't present a problem, provided you post-process the output of the model to compensate for the difference in training set and operational class frequencies. If you don't perform that adjustment (or you use a discrete yes-no classifier) you will over-predict the minority class for the reason given above.
I don't think this accurately represents the situation. The reason for balancing is actually because the majority class is "more important" in some sense than the minority class, and the rebalancing is an attempt to include misclassification costs so that it does work better in operational conditions. However a lot of blogs don't explain that properly, so a lot of practitioners are rather misinformed about it.
Best Answer
If the difference lies only in the relative class frequencies in the training and test sets, then I would recommend the EM procedure introduced in this paper:
Marco Saerens, Patrice Latinne, Christine Decaestecker: Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure. Neural Computation 14(1): 21-41 (2002) (www)
I've used it myself and found it worked very well (you need a classifier that outputs a probability of class membership though).
If the distribution of patterns within each class changes, then the problem is known as "covariate shift" and there is an excellent book by Sugiyama and Kawanabe. Many of the papers by this group are available on-line, but I would strongly recommend reading the book as well if you can get hold of a copy. The basic idea is to weight the training data according to the difference in density between the training set and the test set (for which labels are not required). A simple way to get the weighting is by using logistic regression to predict whether a pattern is drawn from the training set or the test set. The difficult part is in choosing how much weighting to apply.
See also the nice blog post by Alex Smola here.