Solved – Strangely imbalanced dataset

datasetsampling

I'm new to Machine Learning and this forum. I have a beginner's doubt regarding imbalanced dataset. Here it goes:
I have a binary classification task, where I'm more interested in accurately classifying the positive class (which is in minority in the target population). Unlike the common problem of not having enough positive (minority) class instances in the training set, my training dataset contains the positive class in majority.

Here's the target population composition (which I'd expect to find in the environment where my classifier/model would be deployed):

  • Positive Class: ~35%
  • Negative Class: ~65%

Here's my Training Set Composition:

  • Positive Class: ~95%
  • Negative Class: ~5%

As my training set composition drastically differs from the target population composition, will the classification algorithm fail to generalize when classifying instances from the target population? As I mentioned earlier, I'm more interested in accurately classifying the positive class instances, which my training set has in abundance.

I read the following description in a publication on imbalanced datasets:
"The purpose of machine learning is for the classifier to estimate the probability distribution of the target population. Since that distribution is unknown we try to estimate the population distribution using a sample distribution. Statistics tells us that as long as the sample is drawn randomly, the sample distribution can be used to estimate the population distribution from where it was drawn. Hence, by learning the sample distribution we can learn to approximate the target distribution."

Since my training dataset can not be considered as a random sample of the target distribution, will this affect the generalization power of my classifier? If so, what shall be done to avoid this? Over/under-sampling? Cost Matrices?

PS: I tried to search previous posts about issues similar to mine, but all of them dealt with the problem of not having sufficient examples for the minority class (which is the exact opposite of my scenario).

Thanks in advance
-S

Best Answer

There are several components to your question - but first I would ask why is your sample so skewed? You have an under-sampled training set which as you point out is odd. Can you assume that the two classes were sampled randomly from the population? If not, that is your most serious problem and potentially not something you can recover from. The best you can do is build a model, calibrate it and then test it in a pilot on the population.

Assuming representative samples, the issues are:

1) Will this imbalance keep the classifier from properly discriminating between classes? Maybe. You must cross validate any resulting model so this should be testable and you may find you need over sample the negative cases to get the data set into balance. It depends on the type of classifier being used and the data. If you are using random forests or GBM I might not be concerned. If you are using a single decision tree, i would.

2) Will the predicted probabilities from the model align to the population. The answer is no. If this is important to your application (i.e. the model must be well calibrated and not just concerned with ranking or separating the classes) it is a problem but can be overcome. Any time a training data set is used where the class density does not match the population, the resulting probabilities of class membership will be biased. Here is a general purpose way to re-calibrate them:

LINK