Downsampling – Reasons and Advantages in Machine Learning

classificationmachine learning

Suppose I want to learn a classifier that predicts if an email is spam. And suppose only 1% of emails are spam.

The easiest thing to do would be to learn the trivial classifier that says none of the emails are spam. This classifier would give us 99% accuracy, but it wouldn't learn anything interesting, and would have a 100% rate of false negatives.

To solve this problem, people have told me to "downsample", or learn on a subset of the data where 50% of the examples are spam and 50% are not spam.

But I'm worried about this approach, since once we build this classifier and start using it on a real corpus of emails (as opposed to a 50/50 test set), it may predict that a lot of emails are spam when they're really not. Just because it's used to seeing much more spam than there actually is in the dataset.

So how do we fix this problem?

("Upsampling," or repeating the positive training examples multiple times so 50% of the data is positive training examples, seems to suffer from similar problems.)

Best Answer

Most classification models in fact don't yield a binary decision, but rather a continuous decision value (for instance, logistic regression models output a probability, SVMs output a signed distance to the hyperplane, ...). Using the decision values we can rank test samples, from 'almost certainly positive' to 'almost certainly negative'.

Based on the decision value, you can always assign some cutoff that configures the classifier in such a way that a certain fraction of data is labeled as positive. Determining an appropriate threshold can be done via the model's ROC or PR curves. You can play with the decision threshold regardless of the balance used in the training set. In other words, techniques like up -or downsampling are orthogonal to this.

Assuming the model is better than random, you can intuitively see that increasing the threshold for positive classification (which leads to less positive predictions) increases the model's precision at the cost of lower recall and vice versa.

Consider SVM as an intuitive example: the main challenge is to learn the orientation of the separating hyperplane. Up -or downsampling can help with this (I recommend preferring upsampling over downsampling). When the orientation of the hyperplane is good, we can play with the decision threshold (e.g. signed distance to the hyperplane) to get a desired fraction of positive predictions.