Solved – the difference between GLM and GEE

generalized linear modelgeneralized-estimating-equations

Whats the difference between a GLM model (logistic regression) with a binary response variable which includes subject and time as covariates and the analogous GEE model which takes into account correlation between measurements at multiple time points?

My GLM looks like:

Y(binary) ~ A + B1X1(subject id) + B2X2(time) 
              + B3X3(interesting continuous covariate)

with logit link function.

I'm looking for a simple (aimed at the social scientist) explanation of how and why time is treated differently in the two models and what the implications would be for interpretation.

Best Answer

There may be a better and more detailed answer out there, but I can give you some simple, quick thoughts. It appears that you are talking about using a Generalized Linear Model (e.g., a typical logistic regression) to fit to fit data gathered from some subjects at multiple time points. At first blush, I see two glaring problems with this approach.

First, this model assumes that your data are independent given the covariates (that is, after having accounted for a dummy code for each subject, akin to an individual intercept term, and a linear time trend that is equal for everybody). This is wildly unlikely to be true. Instead, there will almost certainly be autocorrelations, for example, two observations of the same individual closer in time will be more similar than two observations further apart in time, even after having accounted for time. (Although they may well be independent if you also included a subject ID x time interaction--i.e., a unique time trend for everybody--but this would exacerbate the next problem.)

Second, you are going to burn up an enormous number of degrees of freedom estimating a parameter for each participant. You are likely to have relatively few degrees of freedom left with which to try to accurately estimate your parameters of interest (of course, this depends on how many measurements you have per person).

Ironically, the first problem means that your confidence intervals are too narrow, whereas the second means your CIs will be much wider than they would have been if you hadn't wasted most of your degrees of freedom. However, I wouldn't count on these two balancing each other out. For what it's worth, I believe that your parameter estimates would be unbiased (although I may be wrong here).

Using the Generalized Estimating Equations is appropriate in this case. When you fit a model using GEE, you specify a correlational structure (such as AR(1)), and it can be quite reasonable that your data are independent conditional on both your covariates and the correlation matrix you specified. In addition, the GEE estimate the population mean association, so you needn't burn a degree of freedom for each participant--in essence you are averaging over them.

As for the interpretation, as far as I am aware, it would be the same in both cases: given that the other factors remain constant, a one-unit change in X3 is associated with a B3 change in the log odds of 'success'.

Related Solutions

Solved – the main difference between GLM and GEE

Indeed, GLMs do not account for correlations you may have in your outcome data. Hence, they are more suitable for cross-sectional data, because in longitudinal data you expect that measurements over time from the same subject are correlated.

With regard to the interpretation of the coefficients you obtain, the GEEs can be seen as the equivalent of GLMs because they will also have a marginal intepretation. This is different than generalized linear mixed models, in which the fixed effects coefficients have an interpretation conditional on the random effects (though based on recent developments it is possible to get coefficients with a marginal intepretation from a GLMM; for more info check here).

With regard to the estimation, as mentioned in one of the comments above, GEEs are not based on a model that has a specific likelihood. On the one hand this makes them semi-parametric and you do not need to specify the distribution of your data, but on the other hand (i) you can only use Wald tests and not likelihood ratio tests, (ii) they are less efficient than a likelihood-based model in which you have appropriately specified the correlation structure, and (iii) in their basic form and with regard to missing data, they are only valid under the missing completely at random missing data mechanism, whereas a likelihood-based approach under the missing at random mechanism.

Regression – Do GEE and GLM Estimate the Same Coefficients?

Yes. GEE and GLM will indeed have the same coefficients, but different standard errors. To check, run an example in R. I've taken this example from Chapter 25 of Applied Regression Analysis and Other Multivariable Methods, 5th by Kleinbaum, et. al (just because it's on my desk and references GEE and GLM):

library(geepack)
library(lme4)

#get book data from 
mydf<-read.table("http://www.hmwu.idv.tw/web/bigdata/rstudio-readData/tab/ch25q04.txt", header=TRUE)
mydf<-data.frame(subj=mydf$subj, week=as.factor(mydf$week), fev=mydf$fev)
#Make 5th level the reference level to match book results
mydf$week<-relevel(mydf$week, ref="5")

#Fit GLM Mixed Model
mixed.model<-summary(lme4::lmer(fev~week+(1|subj),data=mydf))
mixed.model$coefficients

                Estimate Std. Error     t value
(Intercept)  6.99850  0.2590243 27.01870247
week1        2.81525  0.2439374 11.54087244
week2       -0.15025  0.2439374 -0.61593680
week3        0.00325  0.2439374  0.01332309
week4       -0.04700  0.2439374 -0.19267241

#Fit a gee model with any correlation structure.  In this case AR1
gee.model<-summary(geeglm(fev~week, id=subj, waves=week, corstr="ar1", data=mydf))
gee.model$coefficients

            [Estimate   Std.err         Wald  Pr(>|W|)
(Intercept)  6.99850 0.2418413 8.374312e+02 0.0000000
week1        2.81525 0.2514376 1.253642e+02 0.0000000
week2       -0.15025 0.2051973 5.361492e-01 0.4640330
week3        0.00325 0.2075914 2.451027e-04 0.9875090
week4       -0.04700 0.2388983 3.870522e-02 0.8440338][1]

UPDATE

As Mark White pointed out in his comment, I did indeed previously fit a "single-level" Mixed Effects GLM. Since you didn't specify whether you wanted a "fixed effects" or "random" effects GLM model, I just picked "random" since that's the model fit in the book I selected from. But indeed, Mark is right that the coefficients do not necessarily agree in multilevel models, and someone provided a nice answer about that question previously. For your reference, I've added a "fixed" effects GLM model below using lm.

#Fit Traditional GLM Fixed Effect Model (i.e. not Random effects)
glm.fixed<-summary(lm(fev~week, data=mydf))
glm.fixed$coefficients
            Estimate Std. Error     t value     Pr(>|t|)
(Intercept)  6.99850  0.2590243 27.01870247 7.696137e-68
week1        2.81525  0.3663157  7.68531179 7.287752e-13
week2       -0.15025  0.3663157 -0.41016538 6.821349e-01
week3        0.00325  0.3663157  0.00887213 9.929302e-01
week4       -0.04700  0.3663157 -0.12830465 8.980401e-01

Note the first and second columns of the output in each model. They coefficients are identity, but standard errors differ.

You also added a comment which asked, "And does this remain the case when we choose a non-linear link function?" Note first that this is a different question since non-linear link functions generally aren't General Linear Models but Generalized Linear models. In this case, the coefficients do not necessarily match. Here's an example again in R:

#Fit Generalized Linear Mixed Effects Model with, say, Binomail Link
nlmixed.model<-summary(lme4::glmer(I(mydf$fev>mean(mydf$fev))~week+(1|subj), family="binomial", data=mydf))
nlmixed.model$coefficients

#Fit GEE model with, say, Binomial Link
nlgee.model<-summary(geeglm(I(mydf$fev>mean(mydf$fev))~week, id=subj, waves=week, family="binomial", data=mydf))
nlgee.model$coefficients

Best Answer

Related Solutions

Solved – the main difference between GLM and GEE

Regression – Do GEE and GLM Estimate the Same Coefficients?

Related Question