Consider sampling data from a population of size $N$ in the following way: For $k=1, …, N$
-
Observe individual $k$'s "disease" status
-
If they have the disease, include them in the sample with probability $p_{k1}$
-
If they do not have the disease, include them with probability $p_{k0}$.
Suppose you observed a binary outcome variable $Y_i$ and predictor vector ${\bf X}_i$, for $i=1, …, n$ subjects sampled this way. The outcome variable is not the "disease" status. I want to estimate the parameters of the logistic regression model:
$$ \log \left( \frac{ P(Y_i = 1 | {\bf X}_i) }{ P(Y_i = 0 | {\bf X}_i) } \right) = \alpha + {\bf X}_i {\boldsymbol \beta} $$
All I care about are the (log) odds ratios, ${\boldsymbol \beta}$. The intercept is irrelevant to me.
My question is: Can I get sensible estimates of ${\boldsymbol \beta}$ by ignoring the sampling probabilities $\{ p_{i1}, p_{i0} \}$, $i=1, …, n$ and fitting the model as if it were an ordinary random sample?
I am pretty much certain the answer to this question is "yes". What I'm looking for is a reference that validates this.
There are two main reasons I'm confident about the answer:
-
I've done many simulation studies and none of them contradict this, and
-
It is straightforward to show that, if the population is governed by the model above, then the model governing the sampled data is
$$ \log \left( \frac{ P(Y_i = 1 | {\bf X}_i) }{ P(Y_i = 0 | {\bf X}_i) } \right) = \log(p_{i1}) – \log(p_{i0}) + \alpha + {\bf X}_i {\boldsymbol \beta} $$
If the sampling probabilities did not depend on $i$, then this would represent a simple shift to the intercept and the point estimate of ${\boldsymbol \beta}$ would clearly be unaffected. But, if the offsets are different for each person this logic does not quite apply since you will certainly get a different point estimate, although I suspect something similar does.
Related: The classic paper by Prentice and Pyke (1979) says that logistic regression coefficients from a case-control (with disease status as the outcome) have the same distribution as those collected from a prospective study. I suspect this same result would apply here but I must confess I don't fully understand every bit of the paper.
Thanks in advance for any comments/references.
Best Answer
This is a variation of the selection model in econometrics. The validity of the estimates using only the selected sample here depends on the condition that $\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)=\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)$. Here $D_i$ is $i$'s disease status.
To give more details, define the following notations: $\pi_{1}=\Pr\left(D_{i}=1\right)$ and $\pi_{0}=\Pr\left(D_{i}=0\right)$; $S_{i}=1$ refers to the event that $i$ is in the sample. Moreover, assume $D_{i}$ is independent of $X_{i}$ for simplicity.
The probability of $Y_{i}=1$ for a unit $i$ in the sample is \begin{eqnarray*} \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right) & = & \mathrm{{E}}\left(Y_{i}\mid X_{i},S_{i}=1\right)\\ & = & \mathrm{{E}}\left\{ \mathrm{{E}}\left(Y_{i}\mid X_{i},D_{i},S_{i}=1\right)\mid X_{i},S_{i}=1\right\} \\ & = & \Pr\left(D_{i}=1\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1,S_{i}=1\right)+\\ & & \Pr\left(D_{i}=0\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0,S_{i}=1\right), \end{eqnarray*} by the law of iterated expecation. Suppose conditional on the disease status $D_{i}$ and other covariates $X_{i}$, the outcome $Y_{i}$ is independent of $S_{i}$. As a result, we have \begin{eqnarray*} \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right) & = & \Pr\left(D_{i}=1\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)+\\ & & \Pr\left(D_{i}=0\mid S_{i}=1\right)\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right). \end{eqnarray*} It is easy to see that $$ \Pr\left(D_{i}=1\mid S_{i}=1\right)=\frac{\pi_{1}p_{i1}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}\mbox{ and }\Pr\left(D_{i}=0\mid S_{i}=1\right)=\frac{\pi_{0}p_{i0}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}. $$ Here $p_{i1}$ and $p_{i0}$ are as defined your sampling scheme. Thus, $$ \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right)=\frac{\pi_{1}p_{i1}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)+\frac{\pi_{0}p_{i0}}{\pi_{1}p_{i1}+\pi_{0}p_{i0}}\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right). $$ If $ $$\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)=\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)$, we have $$ \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right)=\Pr\left(Y_{i}=1\mid X_{i}\right), $$ and you can omit the sample selection problem. On the other hand, if $\Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)\neq\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)$, $$ \Pr\left(Y_{i}=1\mid X_{i},S_{i}=1\right)\neq\Pr\left(Y_{i}=1\mid X_{i}\right) $$ in general. As a particular case, consider the logit model, $$ \Pr\left(Y_{i}=1\mid X_{i},D_{i}=1\right)=\frac{e^{X_{i}'\alpha}}{1+e^{X_{i}'\alpha}}\mbox{ and }\Pr\left(Y_{i}=1\mid X_{i},D_{i}=0\right)=\frac{e^{X_{i}'\beta}}{1+e^{X_{i}'\beta}}. $$ Even when $p_{i1}$ and $p_{i0}$ are constant across $i$, the resulted distribution will not keep the logit formation. More importantly, the intepretations of the parameters would be totally different. Hopefully, the above arguments help to clarify your problem a little bit.
It is tempted to include $D_{i}$ as an additional explanatory variable, and estimate the model based on $\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$. To justify the validity of using $\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$, we need to prove that $\Pr\left(Y_{i}\mid X_{i},D_{i},S_{i}=1\right)=\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$, which is equivalent to the condition that $D_{i}$ is a sufficient statistic of $S_{i}$. Without further information about your sampling process, I am not sure if it is true. Let's use an abstract notation. The observability variable $S_{i}$ can be viewed as random function of $D_{i}$ and the other random variables, say $\mathbf{Z}_{i}$. Denote $S_{i}=S\left(D_{i},\mathbf{Z}_{i}\right)$. If $\mathbf{Z}_{i}$ is independent of $Y_{i}$ conditional on $X_{i}$ and $D_{i}$, we have $\Pr\left(Y_{i}\mid X_{i},D_{i},S\left(D_{i},\mathbf{Z}_{i}\right)\right)=\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$ by the definition of independence. However, if $\mathbf{Z}_{i}$ is not independent of $Y_{i}$ after conditioning on $X_{i}$ and $D_{i}$, $\mathbf{Z}_{i}$ intuitively contains some relevant information about $Y_{i}$, and in general it is not expected that $\Pr\left(Y_{i}\mid X_{i},D_{i},S\left(D_{i},\mathbf{Z}_{i}\right)\right)=\Pr\left(Y_{i}\mid X_{i},D_{i}\right)$. Thus, in the 'however' case, the ignorance of sample selection could be misleading for inference. I am not very familiar with the sample selection literature in econometrics. I would recommend Chapter 16 of
Microeconometrics: methods and applications' by Cameron and Trivedi (especially the Roy model in that chapter). Also G. S. Maddala's classic book
Limited-dependent and qualitative variables in econometrics' is a systematic treatment of the issues about sample selection and discrete outcomes.