Solved – Rare event logistic regression bias correction

bias correctionlogisticrare-eventsregression

In King and Zheng's paper:
http://gking.harvard.edu/files/gking/files/0s.pdf

They mention about $\tau$ and $\bar{y}$. I already have data with 90000 0's and 450 1's. I have already fitted a logistic regression with the whole data and want to make a prior correction on the intercept.

Or should it be that I take about 3000 0's and 450 1's and then run logistic regression and then apply the prior correction on intercept? would then $\tau$ = 450/90450 and $\bar{y}$ = 450/3450?

Edit Based on answer from Scortchi

I am trying to predict probability of a matchmaking happening. A match might be happening between a buyer and seller, two possible individuals in a dating site, or a job seeker and prospective employee. A 1 is when a match happens, zero for all other pair-wise interactions that have been recorded. I have real life data from one of these use cases. As said before, the rate of 1's in the data is very small (=450/(450+90000). I want to build a logistic regression model with correction from King et.al.

The data I have can be presumed to be all possible data i.e. it is the whole universe. I would presume the rate of 1's in the universe would be 450/(450 + 90000).

I want to sample all the 1's (450 of them) and a random 3000 0's from this data universe. This would be sampling based on 1's. Once the logistic regression is built on this, I want to make a bias correction.

Is it right to presume here that $\tau$ = 450/(450 + 90000) and $\bar{y}$ = 450/(450+3000)?

I am arguing that $\tau$ is indeed the universe estimates because for my use case I pretty much have all the target population data. My question is, with the current setup of the problem how would $\tau$ and $\bar{y}$ be defined? Running time is not the issue, but how to make the bias correction for a rare event is the issue.

Best Answer

They define $\tau$ & $\bar{y}$ too: $\tau$ is the fraction of 1's in the population; $\bar{y}$ is the observed fraction of 1's in the sample (based on prior information).

You'd typically use prior correction when you've sampled based on the outcome; which I'd guess you haven't here. But if you have, then $\bar{y}=\frac{450}{90450}$ & you need to know or estimate $\tau$ in some other way.

Down-sampling, as described (quite correctly) in your last paragraph, can help if the full sample is too large for your computer's memory to hold or for its processor to deal with quickly, by sacrificing a little precision. But in this case you've fit the model on all the data already (I doubt it took very long).

[What you describe in your edit is what I called down-sampling, & you're applying the prior correction correctly. In medical statistics it's called a case–control design—see here. You might want to do it when you have the response but not yet the predictors, & there's an extra cost to measuring those. I don't know why you're calling it "bias correction for a rare event" though: it's a correction of the intercept for the deliberately introduced sampling bias. Section 5 of the paper deals with correcting the bias of maximum-likelihood estimates of log odds ratios & predicted probabilities.]