Solved – Missing at Random Data in GEE

generalized-estimating-equationsmissing data

For a continuous outcome being analyzed using GEE with a linear link, you have assurance that standard errors and point estimates are consistent with a first order trend regardless of distribution of outcome, heteroscedasticity, and mild non-linearity problems. Point estimates from the GEE are the same as those obtained from maximum likelihood (OLS), but the standard error estimates are the HC sandwich based errors and thus swamp up mild bits of classical model assumption violations.

In longitudinal analyses where attrition depends upon measured variables (e.g. age), you know that the so-called "missing data mechanism" is missing at random (not missing COMPLETELY at random, per Little, Rubin 2002) and, further, that maximum likelihood estimates "are not biased" due to the factorization of the likelihood including the missing data indicator and unobserved likelihood contribution due to measured rows.

My questions are:

  1. For ML estimates, are complete case analyses considered efficient?
  2. For GEE with linear link, are estimates somehow biased even though they're the same as those obtained from ML?
  3. Is the real problem that SEs from GEE with linear link are not guaranteed to be consistent? More so than is attributable to effective sample size loss due to complete case analysis?
  4. Does weighting promise to help remedy the the SEs above and beyond effective sample size loss due to complete case analysis if there are other reasons why the GEE would be "wrong" in this case?

Best Answer

  1. ML estimation based on complete cases is not considered efficient and can be horribly biased. Likelihood-based complete case estimation is consistent in general only if the data is MCAR. If data is MAR then you can use something like EM or data augmentation to get efficient likelihood-based estimates. The appropriate likelihood to use for doing maximum likelihood is the joint of the data with the missing data is $$ \ell(\theta \mid Y_{obs}, X) = \log \int p( Y \mid \ X, \theta) \ d Y_{mis} $$ where $Y$ is the response and $X$ is the relevant covariates.
  2. GEE estimation is biased under MAR, just like complete-case ML estimation is biased.
  3. People don't use usual GEE estimation for these problems because they are both inconsistent and inefficient. An easy fix-up for the consistency problem, under MAR, is to weight the estimating equations by their inverse-probability of being observed to get so-called IPW estimates. That is, solve $$ \sum_{i=1}^N \frac{I(Y_i \mbox{ is complete})\varphi(Y_i;X_i, \theta)}{\pi(Y_i;X_i, \theta)} = 0, $$ where $\sum_i \varphi(Y_i; X_i, \theta)=0$ is your usual estimating equation and $\pi(Y;X,\theta)$ is the probability of being completely observed giving the covariates and the data. Incidentally, this violates the likelihood principle and requires estimating the dropout mechanism even if the missingness is ignorable, and can also greatly inflate the variance of estimates. This is still not efficient because it ignores observations where we have partial data. The state of the art estimating equations are doubly-robust estimates which are consistent if either the response model or dropout model are correctly specified and are essentially missing-data-appropriate versions of GEEs. Additionally, they may enjoy an efficiency property called local-semiparametric efficiency which means they attain semiparametric efficiency if everything is correctly specified. See, for example, this book.
  4. Estimating equations which are consistent and efficient essentially all require weighting by the inverse probability of being observed. EDIT: I mean this for semiparametric consistency rather than consistency under a parametric model.

You should also note that typically in longitudinal studies with attrition the dropout can depend both on measured covariates but also on the response at times you didn't observe, so you can't just say "I collected everything I think to be associated with dropout" and say you have MAR. MAR is a genuine assumption about how the world works, and it cannot be checked from the data. If two people with the same response history and same covariates are on study and one drops out and one does not, MAR essentially states that you can use the guy who stayed on to learn the distribution of the guy who dropped out, and this is a very strong assumption. In longitudinal studies, the consensus among experts is that an analysis of sensitivity to the MAR assumption is ideal, but I don't think this has made it into the software world yet.

Unfortunately, I'm not aware of any software for doing doubly robust estimation, but likelihood-based estimation is easy (IMO the easiest thing to do is use Bayesian software for fitting, but there is also lots of software out there). You can also do inverse probability weighting easily, but it has stability issues.

Related Question