Solved – Convert predicted probabilities after downsampling to actual probabilities in classification

classificationdown-sample

If I use undersampling in case of an unbalanced binary target variable to train a model, the prediction method calculates probabilities under the assumption of a balanced data set. I discovered two formulas to convert these probabilities to actual probabilities for the unbalanced data:

p = beta * p_s / ((beta-1) * p_s + 1) from https://www3.nd.edu/~rjohns15/content/papers/ssci2015_calibrating.pdf

and

1/(1+(1/original fraction-1)/(1/oversampled fraction-1)*(1/scoring result-1))
which is described in http://www.data-mining-blog.com/tips-and-tutorials/overrepresentation-oversampling/.

In an example I used they yielded the same result, however the first one doesn't use the oversampled fraction of the target variable's classes. Does anyone know they are exchangable or if one of them is better in certain situations?

Best Answer

The two formulas are equivalent (the first is rather more elegant, IMO).

Let $\alpha$ denote the "original fraction" from the second link, the fraction of the positive class in the population, and let $\alpha'$ denote the (re/over/under)sampled fraction. Keeping $p_s$ as the model's output "probability" score and $p$ the calibrated score as in the first link, the second formula is given in symbols as

$$ p = \frac{1}{1+\frac{\left(\frac{1}{\alpha}-1\right)}{\left(\frac{1}{\alpha'}-1\right)} \cdot \left(\frac{1}{p_s}-1\right)}.$$

That's a terrible mess, but it does have the advantage that each variable appears only once (maybe that's why the post gives it that way?).

The first formula can be rewritten similarly, by dividing numerator and denominator by $\beta p_s$:

$$p = \frac{\beta p_s}{(\beta-1)p_s+1} = \frac{1}{\left(1-\frac{1}{\beta}\right) + \frac{1}{\beta p_s}} = \frac{1}{1+\frac{1}{\beta}\left(-1 + \frac{1}{p_s}\right)}.$$

So now it's clear that these two are equivalent, provided that

$$\beta = \left(\frac{1}{\alpha'}-1\right) / \left(\frac{1}{\alpha}-1\right),$$

which it might be worth pointing out is just the ratio (resampled data to population) of the odds of selecting a positive sample. And indeed, the two formulas for adjusting probabilities have a simpler explanation in terms of the odds: the adjusted odds are $\beta$ times the raw model "odds."

Now, the context of the first link is that we just undersample the negative majority class, and the definition of $\beta$ is the probability that a negative sample is selected. That does use the oversampled prevalence, just not as explicitly.

See also https://datascience.stackexchange.com/q/58631/55122

Related Question