Solved – Estimating logistic regression coefficients in a case-control design when the outcome variable is not case/control status

case-control-studylogistic

Consider sampling data from a population of size $N$ in the following way: For $k=1, …, N$

  1. Observe individual $k$'s "disease" status

  2. If they have the disease, include them in the sample with probability $p_{k1}$

  3. If they do not have the disease, include them with probability $p_{k0}$.

Suppose you observed a binary outcome variable $Y_i$ and predictor vector ${\bf X}_i$, for $i=1, …, n$ subjects sampled this way. The outcome variable is not the "disease" status. I want to estimate the parameters of the logistic regression model:

$$ \log \left( \frac{ P(Y_i = 1 | {\bf X}_i) }{ P(Y_i = 0 | {\bf X}_i) } \right) = \alpha + {\bf X}_i {\boldsymbol \beta} $$

All I care about are the (log) odds ratios, ${\boldsymbol \beta}$. The intercept is irrelevant to me.

My question is: Can I get sensible estimates of ${\boldsymbol \beta}$ by ignoring the sampling probabilities $\{ p_{i1}, p_{i0} \}$, $i=1, …, n$ and fitting the model as if it were an ordinary random sample?


I am pretty much certain the answer to this question is "yes". What I'm looking for is a reference that validates this.

There are two main reasons I'm confident about the answer:

  1. I've done many simulation studies and none of them contradict this, and

  2. It is straightforward to show that, if the population is governed by the model above, then the model governing the sampled data is

$$ \log \left( \frac{ P(Y_i = 1 | {\bf X}_i) }{ P(Y_i = 0 | {\bf X}_i) } \right) = \log(p_{i1}) – \log(p_{i0}) + \alpha + {\bf X}_i {\boldsymbol \beta} $$

If the sampling probabilities did not depend on $i$, then this would represent a simple shift to the intercept and the point estimate of ${\boldsymbol \beta}$ would clearly be unaffected. But, if the offsets are different for each person this logic does not quite apply since you will certainly get a different point estimate, although I suspect something similar does.

Related: The classic paper by Prentice and Pyke (1979) says that logistic regression coefficients from a case-control (with disease status as the outcome) have the same distribution as those collected from a prospective study. I suspect this same result would apply here but I must confess I don't fully understand every bit of the paper.

Thanks in advance for any comments/references.

Best Answer

This is a variation of the selection model in econometrics. The validity of the estimates using only the selected sample here depends on the condition that $\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)=\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)$. Here $D_i$ is $i$'s disease status.

To give more details, define the following notations: $\pi_{1}=\Pr\left(D_{i}=1\right)$ and $\pi_{0}=\Pr\left(D_{i}=0\right)$; $S_{i}=1$ refers to the event that $i$ is in the sample. Moreover, assume $D_{i}$ is independent of $X_{i}$ for simplicity.

The probability of $Y_{i}=1$ for a unit $i$ in the sample is \begin{eqnarray*} \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right) & = & \mathrm{{E}}\left(Y_{i}\mid X_{i},S_{i}=1\right)\\ & = & \mathrm{{E}}\left\{ \mathrm{{E}}\left(Y_{i}\mid X_{i},D_{i},S_{i}=1\right)\mid X_{i},S_{i}=1\right\} \\ & = & \Pr\left(D_{i}=1\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1,S_{i}=1\right)+\\ & & \Pr\left(D_{i}=0\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0,S_{i}=1\right), \end{eqnarray*} by the law of iterated expecation. Suppose conditional on the disease status $D_{i}$ and other covariates $X_{i}$, the outcome $Y_{i}$ is independent of $S_{i}$. As a result, we have \begin{eqnarray*} \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right) & = & \Pr\left(D_{i}=1\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)+\\ & & \Pr\left(D_{i}=0\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right). \end{eqnarray*} It is easy to see that $$ \Pr\left(D_{i}=1\mid S_{i}=1\right)=\frac{\pi_{1}p_{i1}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}\mbox{ and }\Pr\left(D_{i}=0\mid S_{i}=1\right)=\frac{\pi_{0}p_{i0}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}. $$ Here $p_{i1}$ and $p_{i0}$ are as defined your sampling scheme. Thus, $$ \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right)=\frac{\pi_{1}p_{i1}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)+\frac{\pi_{0}p_{i0}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right). $$ If $ $$\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)=\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)$, we have $$ \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right)=\Pr\left(Y_{i}=1\mid X_{i}\right), $$ and you can omit the sample selection problem. On the other hand, if $\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)\neq\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)$, $$ \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right)\neq\Pr\left(Y_{i}=1\mid X_{i}\right) $$ in general. As a particular case, consider the logit model, $$ \Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)=\frac{e^{X_{i}'\alpha}}{1+e^{X_{i}'\alpha}}\mbox{ and }\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)=\frac{e^{X_{i}'\beta}}{1+e^{X_{i}'\beta}}. $$ Even when $p_{i1}$ and $p_{i0}$ are constant across $i$, the resulted distribution will not keep the logit formation. More importantly, the intepretations of the parameters would be totally different. Hopefully, the above arguments help to clarify your problem a little bit.

It is tempted to include $D_{i}$ as an additional explanatory variable, and estimate the model based on $\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$. To justify the validity of using $\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$, we need to prove that $\Pr\left(Y_{i}\mid X_{i},D_{i},S_{i}=1\right)=\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$, which is equivalent to the condition that $D_{i}$ is a sufficient statistic of $S_{i}$. Without further information about your sampling process, I am not sure if it is true. Let's use an abstract notation. The observability variable $S_{i}$ can be viewed as random function of $D_{i}$ and the other random variables, say $\mathbf{Z}_{i}$. Denote $S_{i}=S\left(D_{i},\mathbf{Z}_{i}\right)$. If $\mathbf{Z}_{i}$ is independent of $Y_{i}$ conditional on $X_{i}$ and $D_{i}$, we have $\Pr\left(Y_{i}\mid X_{i},D_{i},S\left(D_{i},\mathbf{Z}_{i}\right)\right)=\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$ by the definition of independence. However, if $\mathbf{Z}_{i}$ is not independent of $Y_{i}$ after conditioning on $X_{i}$ and $D_{i}$, $\mathbf{Z}_{i}$ intuitively contains some relevant information about $Y_{i}$, and in general it is not expected that $\Pr\left(Y_{i}\mid X_{i},D_{i},S\left(D_{i},\mathbf{Z}_{i}\right)\right)=\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$. Thus, in the 'however' case, the ignorance of sample selection could be misleading for inference. I am not very familiar with the sample selection literature in econometrics. I would recommend Chapter 16 of Microeconometrics: methods and applications' by Cameron and Trivedi (especially the Roy model in that chapter). Also G. S. Maddala's classic bookLimited-dependent and qualitative variables in econometrics' is a systematic treatment of the issues about sample selection and discrete outcomes.

Related Question