Solved – the difference between GLM and GEE

generalized linear modelgeneralized-estimating-equations

Whats the difference between a GLM model (logistic regression) with a binary response variable which includes subject and time as covariates and the analogous GEE model which takes into account correlation between measurements at multiple time points?

My GLM looks like:

Y(binary) ~ A + B1X1(subject id) + B2X2(time) 
              + B3X3(interesting continuous covariate)

with logit link function.

I'm looking for a simple (aimed at the social scientist) explanation of how and why time is treated differently in the two models and what the implications would be for interpretation.

Best Answer

There may be a better and more detailed answer out there, but I can give you some simple, quick thoughts. It appears that you are talking about using a Generalized Linear Model (e.g., a typical logistic regression) to fit to fit data gathered from some subjects at multiple time points. At first blush, I see two glaring problems with this approach.

First, this model assumes that your data are independent given the covariates (that is, after having accounted for a dummy code for each subject, akin to an individual intercept term, and a linear time trend that is equal for everybody). This is wildly unlikely to be true. Instead, there will almost certainly be autocorrelations, for example, two observations of the same individual closer in time will be more similar than two observations further apart in time, even after having accounted for time. (Although they may well be independent if you also included a subject ID x time interaction--i.e., a unique time trend for everybody--but this would exacerbate the next problem.)

Second, you are going to burn up an enormous number of degrees of freedom estimating a parameter for each participant. You are likely to have relatively few degrees of freedom left with which to try to accurately estimate your parameters of interest (of course, this depends on how many measurements you have per person).

Ironically, the first problem means that your confidence intervals are too narrow, whereas the second means your CIs will be much wider than they would have been if you hadn't wasted most of your degrees of freedom. However, I wouldn't count on these two balancing each other out. For what it's worth, I believe that your parameter estimates would be unbiased (although I may be wrong here).

Using the Generalized Estimating Equations is appropriate in this case. When you fit a model using GEE, you specify a correlational structure (such as AR(1)), and it can be quite reasonable that your data are independent conditional on both your covariates and the correlation matrix you specified. In addition, the GEE estimate the population mean association, so you needn't burn a degree of freedom for each participant--in essence you are averaging over them.

As for the interpretation, as far as I am aware, it would be the same in both cases: given that the other factors remain constant, a one-unit change in X3 is associated with a B3 change in the log odds of 'success'.

Related Question