Solved – How to exact prediction from over sampled data(Undoing oversampling)


We are oversampling the data to use in logistic regression. Aim is to predict CTR(click probability) which is rare event scenario.
I have predicted the probabilities of click but CTR results are inflated as we over sampled positive class.

model2<-SMOTE(V61 ~ ., z2, perc.over = 600,perc.under=100, learner = 'glm',family=binomial())

Is there any way to undo oversampling results so that I can get exact probabilities ? Based on research so far, one easiest way to divide the output probability by the multiplier we used in over sampling. I dont feel it would be the exact way as I have used synthetic minority over sampling technique(SMOTE) in R.

Best Answer

It doesn't work just to divide the probabilities. Basically you have to adjust the odds, not the probabilities.

There's a nice description and some sample calculations here:

(added in edit) There's a different derivation that gives the same results here:

That blog post is a bit simpler to understand.

I'm not a SMOTE user, and can't comment on the particular applicability to SMOTE.

Related Question