Solved – Intercept bias in logistic case-control regression: which is the reason

biascase-control-studyintuitionlogisticmaximum likelihood

I don't understand the reason why if I use case-control sampling in a logistic regression then the intercept is biased. The book Agresti 2007 (An introduction to categorical data analysis) says: "..the intercept term in the model is not meaningful, because it relates to the relative numbers of outcomes of $y=1$ and $y=0$. We don't estimate this, because the sample frequencies for $y=1$ and $y=0$ are fixed by the nature of case-control study".

What I don't understand is the relation between the fact that in the case-control study I fix the sample frequencies for $y=1$ and $y=0$, and the fact that this is related on the intercept's bias.

Could anyone explain this connection?

Best Answer

No estimate is biased per se, an estimate can only be a biased estimate of something and specifying that something is crucial. In this case, the constant is a biased estimate of the log odds of y=1 in the population when all the explanatory variables are 0. In a case control study you have obviously lost that information by the way you designed that study. Since the constant does not measure what you want it to measure, we call it biased.

Don't be alarmed, though. The word bias sounds bad, but the purpose of a case control study is not to estimate the odds of y=1 in the population, so this bias is irrelevant.


response to comment:

The rare events logic method proposed by Gary King won't help in your case, as it solves a different problem. With a case-control study you cannot estimate the probability of an event, as you did not collect that information in the first place. No method can extract information that is not present in the data.

Consider a simpler problem where you have no explanatory variables. What would you need to estimate the proportion of y=1? You draw a random sample from your observation and compute the proportion of those observations with y=1. In a case control study you start with a number of observations with y=1 and find for each one or more matches with y=0. So the proportion of observations with y=1 in your data only tells you something about your design, but nothing about your population. If you collected 1 control per case then the proportion of cases in your data will be .5, if you collected two controls per case the proportion will be 0.333, etc. This proportion in your data says nothing about how common cases are in the population. This information was never collected, and there is thus no way to recover it from a case control study.