Solved – Why is relative risk not valid in case control studies

case-control-studyoddsrelative-risk

I'm getting my information from:
http://sphweb.bumc.bu.edu/otlt/mph-modules/ep/ep713_analyticoverview/EP713_AnalyticOverview5.html
It says that in a case-control study, you can not compute the probability of disease in each exposure group because you don't have the total amount of people in your population. But you just made a new population for the case control study – why not use this total as the population total? Of course it's a sample of the population you're really interested in so that introduces sampling bias and error but that is true for the odds ratio as well. So you'd have to say it's the approximation of the true relative risk.

Another argument I've read is that in a case-control study, you haven't really taken a cross-section of the population because you started with a fixed amount of people with the same outcome (=diseased), but the amount of individuals in this group (disregarding noise) is irrelevant: if this group is bigger, the amount of of exposed people grows with the same multiplier as the amount of unexposed people. So when calculating the relative risk (risk exposed group / risk in unexposed group) both the numerator and the denominator grow with the same amount, which then makes no difference in the value of the fraction.

Can anyone explain what I'm missing? All thoughts appreciated.

Best Answer

I'll try to explain this more intuitively and with an illustration.

The risk ratio and the odds ratio can be interpreted and calculated as probabilities. These probabilities depend on the study design. Before I start writing formulas, let me be clear with some symbols.

$X$ = outcome

$Y$ = exposure

$\neg{}X$ = no outcome

$P(X|Y)$ = Bayesian probability of X happening, given that Y happened

Risk

For example if you know the complete information from a population and you want to compute the risk (probability) of an outcome, given an exposure, you would write: $$Risk_{pop} = P(X|Y)$$ And the risk ratio between having an outcome, given an exposure, and having an outcome with no exposure, would be: $$RR_{pop} = \frac{P(X|Y)}{P(X|\neg{Y})}$$

Now, if you are sampling from a population, things get a little different, depending on the sampling design. That's because when you sample, you're drawing from a population with a specific probability. If you sample people based on their exposure status (cohort design), and then wait until you see the outcome, you would have $$Risk_{cohort}=\frac{P(X|Y)}{P(X|Y)+P(\neg{X}|Y)}=\frac{P(X|Y)}{1}=Risk_{pop}$$

Which is precisely the same as calculating the population risk. Then if you try calculating the risk ratio, since $Risk_{cohort} = Risk_{pop}$, then $RR_{cohort} = RR_{pop}$. So a cohort study has the perfect design for calculating the population risk.

However, if you sampled people based on their outcome status (case-control design), and then checked whether they were exposed or not, you would get a very different probability, that is the probability of finding an exposure, given that you know the outcome: $$Risk_{case-control}=\frac{P(Y|X)}{P(Y|X)+P(Y|\neg{X})}\ne{}P(X|Y), Risk_{case-control}\ne{}Risk_{pop}$$

and

$$RR_{case-control} = \frac{\frac{P(Y|X)}{P(Y|X)+P(Y|\neg{X})}}{\frac{P(\neg{Y}|X)}{P(\neg{Y}|X)+P(\neg{Y}|\neg{X})}}\ne{}\frac{P(X|Y)}{P(X|\neg{Y})}, RR_{case-control}\ne{}RR_{pop}$$

Therefore, you are not calculating the risk in a case-control study, because the probabilities are not the same.

Odds

The odds of something happening is the probability of it happening divided by the probability of it not happening. For example, you would have 4 times more chance (odds) of winning than of losing if the probability of winning was 80%, because you would divide 80% by 20%. So the chance of an outcome, if you were exposed, would be:

$$Odds_{pop} = \frac{P(X|Y)}{P(\neg{X|Y})}$$ And the Odds Ratio would be the ratio between the odds of cancer if you smoked, and the odds of cancer, if you didn't smoke.

$$OR_{pop} = \frac{\frac{P(X|Y)}{P(\neg{X|Y})}}{\frac{P(X|\neg{}Y)}{P(\neg{X|\neg{}Y})}}$$

Sample odds

If you were doing a case-control study, in which the Odds Ratio would be the choice for measuring the effect size, you would be calculating this:

$$OR_{case-control} = \frac{\frac{\frac{P(Y)P(X|Y)}{P(X)}}{\frac{P(Y)P(\neg{}X|Y)}{P(\neg{}X)}}}{\frac{\frac{P(\neg{}Y)P(X|\neg{}Y)}{P(X)}}{\frac{P(\neg{}Y)P(\neg{}X|\neg{}Y)}{P(\neg{}X)}}} = \frac{\frac{\frac{1.P(X|Y)}{1}}{\frac{1.P(\neg{}X|Y)}{1}}}{\frac{\frac{1.P(X|\neg{}Y)}{1}}{\frac{1.P(\neg{}X|\neg{}Y)}{1}}} = \frac{\frac{P(X|Y)}{P(\neg{}X|Y)}}{\frac{P(X|\neg{}Y)}{P(\neg{}X|\neg{}Y)}} = OR_{pop}$$

I won't write here the equation for the Odds Ratio in a cohort study, because it would be exactly the same as the population odds ratio, therefore they are also the same. Therefore, the odds ratio is an effect size measure that is adequate for both case-control and cohort designs, because they all measure the same thing.

Simulation example

Now what would happen if you indeed tried to calculate a RR from a case-control design, what would happen?

Distribution of effect sizes in different study designs

This figure is the result of a simulation of a population of 2 million people, in which 20% smoked, 2% of the smoking population had cancer and 1% of the non smoking population had cancer. I simulated a cohort and a case-control design, with adequate sample sizes, and repeated the estimates 40 times in each case, for each effect size calculation. The code can be found here.

You can see that the distribution of the effect sizes are all similar for both study designs, when you are computing the adequate measures. However, when computing the RR in a case-control study, the distribution is very different from the others, never getting close to the true risk.

Related Question