Regression – Why Calibrate Output Probabilities When Undersampling or Oversampling?

machine learningregressionsample-sizesamplingunbalanced-classes

There are a number of stackexchange posts saying that you have to calibrate your probabilities if you oversample and undersample like 1,2. But my question is why?

Lets use an example of detecting spam emails. Because actual spam emails are rare we have to either over sample them, or undersample the non spam emails, or change the class weights e.t.c so that our classifier actually learns to seperate the two classes instead of predicting all as non spam.

So now our classifier has learnt P(spam|input). What I dont understand is why this probability is now biased? If the probability is incorrect e.g. giving a high probability to an input to be spam when it isn't, doesn't that just mean it has learnt the wrong things during training, and not because of the "bias" you introduced because of oversampling/undersampling?

Best Answer

I'm taking a lot of my answer from Agresti's "Categorical Data Analysis". For some context, let's say we want to predict if an email is spam or ham using a single binary variable, namely if the subject line contains the word "Viagra". Let's call this predictor $x$. This is a simplification of our more general problem, but will suffice.

Under/Oversampling can be thought of as a case-control study. Viewing the spam/ham problem as a case-control design, we would fix the marginal distribution of ham to spam (via oversampling in this case) and the outcome of the study would be if the email subject line had the word "Viagra". Before oversampling, our estimate of $Pr(\mbox{spam})=\delta$ (its just the proportion of our sample which is spam). After oversampling, our estimate of $Pr(\mbox{spam})=\delta^\star >\delta$. This will be important later.

In most other studies, we would want to know $Pr(y=\mbox{spam} \vert x)$. This is referred to as the conditional distribution of spam (conditional on $x$ in this case). However, because of the sampling design fixing the marginal distribution of ham/spam, we can't estimate the conditional distribution of spam, but we can estimate the conditional distribution of $x$, $Pr(x \vert y=\mbox{spam})$.

In order to get the conditional distribution of spam, we would need to account for the prevalence of spam. Following Lachin from chapter 5 of his book Biostatistical Methods: The Assessment of Relative Risks, 2nd Edition and an application of Bayes' Rule, the conditional distribution of spam would be calculated as

$$Pr(\mbox{spam} \vert x) = \dfrac{Pr(x\vert \mbox{spam})\cdot \delta}{Pr(x \vert \mbox{spam}) \cdot \delta + Pr(x \vert \mbox{ham}) \cdot (1-\delta)}$$

Can you spot the problem now?

Here is the problem: the sampling design fixes the prevalence in our sample to be something else other than $\delta$. In essence, we have forced the prevalence to be $\delta^ \star > \delta$ via oversampling. Hence, any estimate of the risk of spam from the data we have using oversampling is biased precisely because the prevalence is biased by design.

doesn't that just mean it has learnt the wrong things during training

Some of what you have learned would be wrong, but surprisingly not everything. The prevalence would certainly be wrong, hence the estimated risk would be wrong, but the relationship between $x$ and the risk of spam is unaffected. From Agresti (edited to align with our example),

We can find the odds ratio, however, because it treats the variables symmetrically, taking the same value using the conditional distribution of [$x$] given [spam] as it does using the conditional distribution of [spam] given [$x$].

So our model would learn the correct relationships between inputs and outputs, but the probabilities would be biased.

Let's make this more concrete by modelling $Pr(\mbox{spam} \vert x)$ with a logistic regression. Our model would be

$$ \operatorname{logit}(Pr(\mbox{spam} \vert x)) = \beta_0 + \beta_1 x$$

If we were to run a case-control study on the spam/ham problem, $\beta_0$ would be biased, but not $\beta_1$. Its easy to demonstrate this too via simulation. I will simulate data from the model

$$ \operatorname{logit}(p) = -2.2 + 0.46x $$

and upsample the minority class. Then, I will compute the difference between the estimated coefficients. I'll do this 1000 times and plot a histogram of the differences. We will see that $\beta_1 - 0.46$ will be centred around 0 (hence unbiased) whereas $\beta_0 - (-2.2)$ will not be centered around 0 (hence biased) due to the upsampling. I've added a red line at the point of 0 difference for reference.

enter image description here

Because the intercept is biased, the entire risk estimate is biased. Not performing the upsampling and fitting the model on the raw data fixes this bias (shown below, though it should be noted that the estimates are asymptotically unbiased, so we would need enough data for this to work).

enter image description here

Code to reproduce the plots:

library(tidyverse)

z = rerun(1000, {
  # sample data to fit the model to
  n = 1000
  x = rbinom(n, 1, 0.5)
  b0 = qlogis(0.1)
  b1 = (qlogis(0.15) - qlogis(0.1))
  b = c(b0, b1)
  p = plogis(b0 + b1*x)
  y = rbinom(n, 1, p)
  
  d = tibble(x, y)
  
  # upsample postivie cases
  nsamp = length(y[y==0]) - length(y[y==1])
  yd = filter(d, y==1) %>% 
       sample_n(size=nsamp, replace = T)
  
  newd = bind_rows(yd, d)
  
  # switch data to newd if you want to get the biased estimates
  model = glm(y~x, data=d, family=binomial())
  estbeta = coef(model)
  tibble(coef = c('Intercept','slope'), difference = b - estbeta)
})


bind_rows(z) %>% 
  ggplot(aes(difference))+
  geom_histogram(color = 'black', fill = 'dark gray')+
  facet_wrap(~coef, ncol = 1)+
  geom_vline(aes(xintercept=0), color = 'red')+
  theme_light()+
  theme(aspect.ratio = 1/1.61)
```