Solved – How to exact prediction from over sampled data(Undoing oversampling)

classificationoversamplingrsampling

We are oversampling the data to use in logistic regression. Aim is to predict CTR(click probability) which is rare event scenario.
I have predicted the probabilities of click but CTR results are inflated as we over sampled positive class.

model2<-SMOTE(V61 ~ ., z2, perc.over = 600,perc.under=100, learner = 'glm',family=binomial())

Is there any way to undo oversampling results so that I can get exact probabilities ? Based on research so far, one easiest way to divide the output probability by the multiplier we used in over sampling. I dont feel it would be the exact way as I have used synthetic minority over sampling technique(SMOTE) in R.

Best Answer

It doesn't work just to divide the probabilities. Basically you have to adjust the odds, not the probabilities.

There's a nice description and some sample calculations here: https://yiminwu.wordpress.com/2013/12/03/how-to-undo-oversampling-explained/

(added in edit) There's a different derivation that gives the same results here:

http://blog.data-miners.com/2009/09/adjusting-for-oversampling.html

That blog post is a bit simpler to understand.

I'm not a SMOTE user, and can't comment on the particular applicability to SMOTE.

Related Question