Solved – ML: sampling imbalanced dataset leads to selection bias

machine learningsamplingunbalanced-classes

By sampling we make the algorithm think that the prior probabilities of the classes are the same. This seems to affect the predictions as well and therefore the probabilities cannot be interpreted as probabilities anymore and have to be recalibrated.

I am a bit confused on why the equality of the prior class distribution affects the predictions.

Let's assume we have a two-class classification problem with an imbalanced dataset that we oversample to balance the class distribution. We run decision trees on it. The test set is imbalanced but does it really matter? Each sample of the test set just goes through the nodes of the decision trees and it never checks if the sample belongs to the majority or minority class.

So, why does the prior probability of classes affect the prediction of a sample?

Best Answer

By sampling we make the algorithm think that the prior probabilities of the classes are the same. This seems to affect the predictions as well and therefore the probabilities cannot be interpreted as probabilities anymore and have to be recalibrated.

You seem mostly correct, the phenomenon is called prior probability shift. It is the case where $P_{train}(x|y) = P_{test}(x|y)$ but $P_{train}(y) \neq P_{test}(y)$. I'm not sure about what you mean by 'probabilities not remaining the same and have to be recalibrated'.

Lets assume we have a two class classification problem, imbalanced datasets that we oversample to get the same class distribution. We run decision trees on it. The test set in imbalanced but does it really matter?

Yes, it matters and that is the cause of the problem.

Each sample of the test set just goes through the nodes of the decision trees and it never checks if the sample belongs to the majority or minority class.

Correct. The problem is not with how decision tree predicts a given sample point. The issue lies with the way it was trained and with the characterization of the feature space for each class.

Oversampling the minority class is a way to deal with the imbalanced class problem but it is not ideal. When the minority class is over-sampled by increasing amounts, the effect is to identify similar but more specific regions in the feature space as the decision region for the minority class. The decision tree would predict a given point in the way that you mentioned but if its decision regions are not accurate based on the way it was trained then it won't predict well.

So, why does the prior probability of classes affect the prediction of a sample?

Prior probability shift is a particular type of dataset shift. There's a fair amount of work in the literature on this topic and whether it's a generative or discriminative model, both of them suffer from the problem. The general idea is whether you are trying to train a discriminative model $P(y|x) = \frac{P(x|y)P(y)}{P(x)}$ or a generative model $P(x,y) = P(x|y)P(y)$, the change in $P(y)$ affects $P(y|x)$ and $P(x,y)$. If the $P(x)$ changes in train and test dataset, then the phenomenon is called covariate shift. You can learn more about dataset shift here, it was probably one of the first compilation of the work done on dataset shift.

On a side note, you can refer to this paper on SMOTE. It addresses the oversampling issue with decision tree and provides a better way to rebalance the dataset by creating synthetic points of the minority class. It is widely used and I believe various implementations of this method already exists.

Related Solutions

Solved – Sampling for Imbalanced Data in Regression

Imbalance is not necessarily a problem, but how you get there can be. It is unsound to base your sampling strategy on the target variable. Because this variable incorporates the randomness in your regression model, if you sample based on this you will have big problems doing any kind of inference. I doubt it is possible to "undo" those problems.

You can legitimately over- or under-sample based on the predictor variables. In this case, provided you carefully check that the model assumptions seem valid (eg homoscedasticity one that springs to mind as important in this situation, if you have an "ordinary" regression with the usuals assumptions), I don't think you need to undo the oversampling when predicting. Your case would now be similar to an analyst who has designed an experiment explicitly to have a balanced range of the predictor variables.

Edit - addition - expansion on why it is bad to sample based on Y

In fitting the standard regression model $y=Xb+e$ the $e$ is expected to be normally distributed, have a mean of zero, and be independent and identically distributed. If you choose your sample based on the value of the y (which includes a contribution of $e$ as well as of $Xb$) the e will no longer have a mean of zero or be identically distributed. For example, low values of y which might include very low values of e might be less likely to be selected. This ruins any inference based on the usual means of fitting such models. Corrections can be made similar to those made in econometrics for fitting truncated models, but they are a pain and require additional assumptions, and should only be employed whenm there is no alternative.

Consider the extreme illustration below. If you truncate your data at an arbitrary value for the response variable, you introduce very significant biases. If you truncate it for an explanatory variable, there is not necessarily a problem. You see that the green line, based on a subset chosen because of their predictor values, is very close to the true fitted line; this cannot be said of the blue line, based only on the blue points.

This extends to the less severe case of under or oversampling (because truncation can be seen as undersampling taken to its logical extreme).

enter image description here

# generate data
x <- rnorm(100)
y <- 3 + 2*x + rnorm(100)

# demonstrate
plot(x,y, bty="l")
abline(v=0, col="grey70")
abline(h=4, col="grey70")
abline(3,2, col=1)
abline(lm(y~x), col=2)
abline(lm(y[x>0] ~ x[x>0]), col=3)
abline(lm(y[y>4] ~ x[y>4]), col=4)
points(x[y>4], y[y>4], pch=19, col=4)
points(x[x>0], y[x>0], pch=1, cex=1.5, col=3)
legend(-2.5,8, legend=c("True line", "Fitted - all data", "Fitted - subset based on x",
    "Fitted - subset based on y"), lty=1, col=1:4, bty="n")

Solved – Strangely imbalanced dataset

There are several components to your question - but first I would ask why is your sample so skewed? You have an under-sampled training set which as you point out is odd. Can you assume that the two classes were sampled randomly from the population? If not, that is your most serious problem and potentially not something you can recover from. The best you can do is build a model, calibrate it and then test it in a pilot on the population.

Assuming representative samples, the issues are:

1) Will this imbalance keep the classifier from properly discriminating between classes? Maybe. You must cross validate any resulting model so this should be testable and you may find you need over sample the negative cases to get the data set into balance. It depends on the type of classifier being used and the data. If you are using random forests or GBM I might not be concerned. If you are using a single decision tree, i would.

2) Will the predicted probabilities from the model align to the population. The answer is no. If this is important to your application (i.e. the model must be well calibrated and not just concerned with ranking or separating the classes) it is a problem but can be overcome. Any time a training data set is used where the class density does not match the population, the resulting probabilities of class membership will be biased. Here is a general purpose way to re-calibrate them:

LINK

Best Answer

Related Solutions

Solved – Sampling for Imbalanced Data in Regression

Solved – Strangely imbalanced dataset

Related Question