Solved – How to simulate data for generalized estimating equations (GEE) with logistic link function

clustered-standard-errorsgeneralized-estimating-equationsmonte carlopaired-datasimulation

I am working with a pre/post test structure, measuring dichotomous outcomes. I am using GEE to estimate the coefficient for time (also a binary variable, 0 representing pre and 1 representing post), given that most people participate in only one of the time points, but there are a number of people who are represented in both.

I am interested in performing a Monte Carlo simulation as a "sensitivity test" to assess how many dependent data points there need to be before ignoring dependencies in data causes a huge problem in inference (e.g., 3% of the data points are from people in both the pre and post data? 10%? 50%?).

Thus, I am interested in simulating clustered data that are meant to be estimated using GEE, but I will compare the GEE coefficients and standard errors to the regular maximum likelihood generalized linear model (GLM) estimates.

However, I am unsure how to simulate data for a GEE. I know the data take the same functional form as in a traditional GLM:

$$g(\mu_{ij}) = X'_{ij}\beta$$

where $g()$, in my case, is the logistic link function. I also know that GEE is basically a different estimation method (often contrasted against maximum likelihood) combined with a method of estimating robust variances (described well in this answer: https://stats.stackexchange.com/a/62924/130869).

We also specify a error correlation structure, which I am assuming as exchangeable. I do not know how to simulate this, as binary data using the logistic link function do not have an error term in the equation.

Most examples of how to generate correlated data assume a parametric form of a traditional multilevel model, where I would simulate something like this (note the lack of error term):

$$\text{logit}(\pi_{ij}) = \beta_{0j}$$ and $$\beta_{0j} = \gamma_{00} + \gamma_{01}X_j + u_{0j}$$

This would involve simulating an average intercept, an average treatment effect, and a subject-specific deviation from that intercept (from a normal distribution, which a GEE does not assume).

However, is there a way to simulate my data that are more explicitly linked to GEE? I would like to specify a range of correlations that represent the true correlation in the exchangeable error structure, the proportion of data points that are dependent, and the average effect of time, in line with the assumptions of a generalized estimating equation.

Update, taking into account @IsabellaGhement's suggestion in the comments below. I tried using the simstudy package, but the observations aren't acting as if they are dependent when throwing them in a regression. Below, I simulate paired data and then run a GEE and GLM on them. The distributions of the estimates are the same, whereas I've seen previous simulations showing that a consequence of non-independence is a wider distribution of parameters. Relatedly, I get better 95% CI coverage with a GLM rather than a GEE; again, I've seen simulations where dependence of observations harms coverage for GLM models, which assume iid. What am I missing in trying to simulate dependent observations? The same type of thing occurred when I used mvtnorm::rmvnorm() and dichotomized the variables.

library(tidyverse)
library(simstudy)
library(geepack)
set.seed(1839)
logit <- function(p) log(p / (1 - p))
results <- lapply(1:2000, function(zzz) {
  dat <- genCorGen(n = 1000, nvars = 2, params1 = c(.50, .53), 
                   dist = "binary", rho = .80, corstr = "cs", wide = FALSE)
  b1_pop <- logit(.53) # leveraging the fact that logit(.50) is 0
  gee_mod <- geeglm(X ~ period, binomial, dat, id = id, corstr = "exchangeable")
  gee_ub <- summary(gee_mod)$coef[2, 1] * 1.96 + summary(gee_mod)$coef[2, 2]
  gee_lb <- summary(gee_mod)$coef[2, 1] * 1.96 - summary(gee_mod)$coef[2, 2]
  gee_cover <- b1_pop < gee_ub & b1_pop > gee_lb
  glm_mod <- glm(X ~ period, binomial, dat)
  glm_ub <- summary(glm_mod)$coef[2, 1] * 1.96 + summary(glm_mod)$coef[2, 2]
  glm_lb <- summary(glm_mod)$coef[2, 1] * 1.96 - summary(glm_mod)$coef[2, 2]
  glm_cover <- b1_pop < glm_ub & b1_pop > glm_lb

  c(b1_gee = coef(gee_mod)[[2]], b1_glm = coef(glm_mod)[[2]],
    gee_cover = gee_cover, glm_cover = glm_cover)
})
results <- as.data.frame(do.call(rbind, results))
colMeans(results)
results %>% 
  gather() %>% 
  ggplot(aes(x = value, fill = key)) +
  geom_density() +
  facet_wrap(~ key)

The call to colMeans returns:

   b1_gee    b1_glm gee_cover glm_cover 
    0.120     0.120     0.243     0.366

Indicating that the mean parameter estimate for both GEE and GLM were the population parameter, and the GLM had 37% coverage, while the GEE had 24%.

The call to ggplot shows that the distribution of parameter estimates were essentially the same:

Best Answer

I just came across this post and this may be too late, but what you did was complete correct except that your formulae for calculating CIs was wrong. The correct formulae should be:

gee_ub <- summary(gee_mod)$coef[2, 1] + 1.96 * summary(gee_mod)$coef[2, 2]

gee_lb <- summary(gee_mod)$coef[2, 1] - 1.96 * summary(gee_mod)$coef[2, 2]

and same for GLM. I tried your codes with above correction and got below:

   b1_gee    b1_glm gee_cover glm_cover 
    0.120     0.120     0.948     0.998

It shows GEE is correct while GLM results in too small p values thus coverage rate too high, which should be the case as your simulated data is correlated.

Option 1

For your ordinal logistic model, assume that there is an underlying continuous latent variable that, when thresholds are applied, results in your observed ordinal $Y$. (Also assume that your software allows you to access that latent variable.)

Run a second GEE predicting that latent variable with the same predictors used in the ordinal model. From there, you can use Zheng's marginal $R^2$:

$R_{marginal}^2 = 1- \frac{\sum_{c=1}^C \sum_{i=1}^N (Y_{ic} - \widehat {Y_{it}})^2} {\sum_{c=1}^C \sum_{i=1}^N (Y_{ic} - \bar Y)^2} $

where the numerator is the sum of the squares of the Y (your latent variable) minus the fitted values from this second GEE across each cluster ( $c_1, c_2, ... c_C$ ) and each observation ($i_1, i_2, ... i_N$ ), and the denominator is the sum of the squares of the Y (your latent variable) minus the marginal mean of that Y.

Option 2

Ignore the ordered nature of your outcome variable and use Zheng's $H_{marginal}$ as a measure of "proportional reduction in entropy due to the model" where your model becomes a multinomial logistic model. $H_{marginal}$ is defined as

$H_{marginal} = 1 - \frac{\sum_{c=1}^C \sum_{i=1}^N \sum_{k=1}^K \hat \pi_{cik} log(\hat \pi_{cik}) } { nT\sum_{k=1}^K \hat \alpha_k log(\hat \alpha_k) } $

where $ \pi_{ck} = P( Y_c = k | X) $ is the "model-based probability that a categorical response [in cluster $c$] equals $k$", $\alpha_k = P(Y = k)$ is "the marginal probability of response $k$", and hats (^) indicate estimates.

Note that for both $R_{marginal}$ and $H_{marginal}$, you can obtain a "negative value when there is greater uncertainty in prediction under the model of than under the null model".

Zheng, B. (2000). Summarizing the goodness of fit of generalized linear models for longitudinal data. Statistics in Medicine, 19(10), 1265-1275. doi: 10.1002/(SICI)1097-0258(20000530)19:10<1265::AID-SIM486>3.0.CO;2-U

Solved – simulate GLM with square root link in R

I figured out some answers to my questions.

Regarding the mathematical expressions of predict and simulate, these can be obtained by the following code (thanks to a tip from W. van der Elst):

getS3method(c("predict"), class = "glm")
getS3method(c("simulate"), class = "lm")

Regarding the predict function, I incorrectly used the option type=”response”. This does not include the stochastic uncertainty as was my aim. It is the prediction on the backtransformed linear scale. This can be tested by the following graph (run after the initial code stated in the question):

plot(predict(m, type="response"), p_lin^2)

Regarding the simulate function, it seemed to be correct if I look at the plot from the initial question:

plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))

However, if I upscale the dependent variable:

d[,1] <- d[,1]*10000

And recalculate the predictions:

m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))
p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))

And plot the results:

par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p)), max(c(d[,1], p)))
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))

I see different predictions.

This seems to be attributable to a difference between the value of the sd in my general formula approach and the simulate() function approach. Recall the calculation of the sd in the general formulas approach:

sd=sd(p_lin^2 - d[,1])

in the simulate function R calculated the sd differently from the general formulas approach (for a Gaussian family):

vars <- deviance(m)/df.residual(m)
if (!is.null(m$weights)) vars <- vars/m$weights # the m$weights seems similar to the m$fitted.values multiplied by about 4
fitted(m) + rnorm(n, sd = sqrt(vars))

I don’t understand why the sd is calculated and then divided by the m$weights. Why is sd a vector and not a single value? The simulate() help text states: “The methods for linear models fitted by lm or glm(family = "gaussian") assume that any weights which have been supplied are inversely proportional to the error variance.” I can’t seem to grasp the meaning of this. If I run multiple simulations using the simulate() function, the simulations look very much like each other:

plot(simulate(m)$sim_1,simulate(m,2)$sim_2)

If I do not use a sqrt-link function I get simulations that seem to better reflect stochastic uncertainty as they are less identical when I run them twice:

library(MASS)
rm(list=ls())
set.seed(2)
n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0), Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2)
d[,1] <- d[,1]*10000
m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4])
plot(simulate(m)$sim_1,simulate(m,2)$sim_2)

What method is correct? What is the difference? (should I post this as a separate question?)

Best Answer

Related Solutions

Solved – R-squared equivalent for Generalized Estimating Equations (GEE) using a ordinal logistic regression model

Option 1

Option 2

Solved – simulate GLM with square root link in R

Related Question