Logistic Regression – Insights on Case-Control Study and Logistic Regression

case-control-studyexperiment-designgwaslogistic

Suppose we have case-control data, where cases have some disease ($Y$) and controls don't and we are interested in the association of some other variable(s) ($X$). I know that in this scenario we cannot use the disease as the response variable because of the experimental design (the marginal distribution of disease is fixed by sampling).

I also know that the odds ratio however can be calculated in such designs because it takes the same value when using the conditional distribution of $X|Y$ or $Y|X$.

My question is: is it appropriate to use logistic regression in this case to model the odds of disease? i.e. $\text{logit}\left\{\dfrac{P(Y=1)}{1-P(Y=1)}\right\} = \mathbf{\beta} X $

context: GWAS (Genome Wide Association Studies) are typically case-control studies, where one wants to assess the association between disease and number of minor alleles of a particular SNP. $P$-values are typically obtained from a chi-squared test of independence. However, this doesn't allow you to add in other covariates. A lot of the packages that offer GWAS analysis also allow you to do logistic regression. I just wanted to verify that it was valid to do such an analysis.

Best Answer

Logistic regression is a valid inferential method, because, as you've noted you're modeling the odds. The coefficients on explanatory variables $X$ will also be valid. However, the intercept term $\beta_0$ will not be; this is because the number of positive and negative outcomes are fixed by the case-control design. So the intercept term will be meaningless, but your other estimates are fine. More information is in Agresti, An Introduction to Categorical Data Analysis (second edition; 2007), p. 105.

Related Question