Solved – Cluster selection and formula in (longitudinal) GEE models

generalized-estimating-equationsr

I have a question according the "formula interface" from GEE models, for instance when using the gee function from the R gee package.

Let's say I have a measured quality of life (QoL), education and sex from 100 subjects at three different time points (time). If I understand the GEE model approach correctly, GEE can be used for longitudinal, clustered data. However, I wonder what my clusters would be? The cluster variable is passed to the argument id within the gee function, so what would be the right syntax if I want to measure change in QoL over time, and how this change differs depending on education and sex?

  1. Is time my cluster variable?

gee(QoL ~ education + sex, id = time)

  1. Is the subject-ID my cluster?

gee(QoL ~ education + sex + time, id = subject-ID)

However, this looks like a random slope approach of mixed models to me.

  1. Is probably education my cluster?

gee(QoL ~ sex + time, id = education)

  1. Last: I don't have any real clusters. But what would I then choose to analyze the longitudinal data, to account for the correlation of my DV QoL for same subjects at different time points?

Maybe I'm confused because I try to compare the formula syntax to the one from longitudinal data analysis with lme4, where the decision which variables to choose for random intercept and slope is quite clear (time and subject) – however, if I do not have individual differences (i.e. I'm interested in the population average), what are the clusters in GEE model for?

Best Answer

You will want to use subject (or subject ID) as your cluster. GEE takes into account the repeated measurements on clusters, in this case the repeated measure is on individuals over time. So, you'd want to use

gee(QoL ~ education + sex + time, id = subject-ID)

An easy way to determine what the cluster is, is to determine what object are multiple measurements being taken on. In this case, the multiple measurements are being made on a subject. You aren't making measurements on "a time" or on "an education."

By the way, I would recommend using geeglm as you can control the ordering of the measurements using the waves argument to geeglm, which I find is usually needed.