Generalized Linear Model – Estimating Multilevel Logistic Regression Models

generalized linear modellogisticmultilevel-analysisrsimulation

The following multilevel logistic model with
one explanatory variable at level 1 (individual level) and
one explanatory variable at level 2 (group level) :

$$\text{logit}(p_{ij})=\pi_{0j}+\pi_{1j}x_{ij}\ldots (1)$$
$$\pi_{0j}=\gamma_{00}+\gamma_{01}z_j+u_{0j}\ldots (2)$$
$$\pi_{1j}=\gamma_{10}+\gamma_{11}z_j+u_{1j}\ldots (3)$$

where , the group-level residuals $u_{0j}$ and $u_{1j}$ are assumed to have a multivariate normal distribution with expectation zero . The variance of the residual errors $u_{0j}$ is specified as $\sigma^2_0$ , and the variance of the residual errors $u_{1j}$ is specified as $\sigma^2_1$ .

I want to estimate the parameter of the model and I like to use R command glmmPQL .

Substituting equation (2) and (3) in equation (1) yields ,

$$\text{logit}(p_{ij})=\gamma_{00}+\gamma_{10}x_{ij}+\gamma_{01}z_j+\gamma_{11}x_{ij}z_j+u_{0j}+u_{1j}x_{ij}\ldots (4)$$

There are 30 groups$(j=1,…,30)$ and 5 individual in each group .

R code :

   #Simulating data from multilevel logistic distribution 
   library(mvtnorm)
   set.seed(1234)

   J <- 30             ## number of groups
   n_j <- rep(5,J)     ## number of individuals in jth group
   N <- sum(n_j)

   g_00 <- -1
   g_01 <- 0.3
   g_10 <- 0.3
   g_11 <- 0.3

   s2_0 <- 0.13  ##variance corresponding to specific ICC
   s2_1 <- 1     ##variance standardized to 1
   s01  <- 0     ##covariance assumed zero

   z <- rnorm(J)
   x <- rnorm(N)

   #Generate (u_0j,u_1j) from a bivariate normal .
   mu <- c(0,0)
  sig <- matrix(c(s2_0,s01,s01,s2_1),ncol=2)
  u <- rmvnorm(J,mean=mu,sigma=sig,method="chol")

  pi_0 <- g_00 +g_01*z + as.vector(u[,1])
  pi_1 <- g_10 + g_11*z + as.vector(u[,2])
  eta <- rep(pi_0,n_j)+rep(pi_1,n_j)*x
  p <- exp(eta)/(1+exp(eta))

  y <- rbinom(N,1,p)

Now the parameter estimation .

  #### estimating parameters 
  library(MASS)
  library(nlme)

  sim_data_mat <- matrix(c(y,x,rep(z,n_j),rep(1:30,n_j)),ncol=4)
  sim_data <- data.frame(sim_data_mat)
  colnames(sim_data) <- c("Y","X","Z","cluster")
  summary(glmmPQL(Y~X*Z,random=~1|cluster,family=binomial,data=sim_data,,niter=200))

OUTPUT :

      iteration 1
      Linear mixed-effects model fit by maximum likelihood
      Data: sim_data 

      Random effects:
      Formula: ~1 | cluster
              (Intercept)  Residual
      StdDev: 0.0001541031 0.9982503

      Variance function:
      Structure: fixed weights
      Formula: ~invwt 
      Fixed effects: Y ~ X * Z 
                      Value Std.Error  DF   t-value p-value
      (Intercept) -0.8968692 0.2018882 118 -4.442404  0.0000
      X            0.5803201 0.2216070 118  2.618691  0.0100
      Z            0.2535626 0.2258860  28  1.122525  0.2712
      X:Z          0.3375088 0.2691334 118  1.254057  0.2123
      Correlation: 
           (Intr) X      Z     
      X   -0.072              
      Z    0.315  0.157       
      X:Z  0.095  0.489  0.269

      Number of Observations: 150
      Number of Groups: 30

Why does it take only $1$ iteration while I mentioned to take $200$ iterations inside the function glmmPQL by the argument niter=200 ?
Also p-value of group-level variable $(Z)$ and cross-level interaction $(X:Z)$ shows they are not significant . Still why in this article, they keep the group-level variable $(Z)$ and cross-level interaction $(X:Z)$ for further analysis ?
Also How are the degrees of freedom DF being calculated ?

It doesn't match with the relative bias of the various estimates of the table . I tried to calculate the relative bias as :

 #Estimated Fixed Effect parameters :

 hat_g_00 <- -0.8968692 #overall intercept
 hat_g_10 <- 0.5803201  # X
 hat_g_01 <-0.2535626   # Z
 hat_g_11 <-0.3375088   #X*Z

fixed <-c(g_00,g_10,g_01,g_11)
hat_fixed <-c(hat_g_00,hat_g_10,hat_g_01,hat_g_11)


#Estimated Random Effect parameters :

hat_s_0 <-0.0001541031  ##Estimated Standard deviation of random intercept 
hat_s_1 <-  0.9982503 

std  <- c(sqrt(0.13),1) 
hat_std  <- c(0.0001541031,0.9982503) 

##Relative bias of Fixed Effect :
rel_bias_fixed <- ((hat_fixed-fixed)/fixed)*100
[1] -10.31308  93.44003 -15.47913  12.50293

##Relative bias of Random Effect :
rel_bias_Random <- ((hat_std-std)/std)*100
[1] -99.95726  -0.17497

Why doesn't the relative bias match with the table ?

Best Answer

There are perhaps too many questions here. Some comments:

you might consider using glmer from the lme4 package (glmer(Y~X*Z+(1|cluster),family=binomial,data=sim_data)); it uses Laplace approximation or Gauss-Hermite quadrature, which are generally more accurate than PQL (although the answers are very similar in this case).
The niter argument specifies the maximum number of iterations; only one iteration was actually necessary
I'm not sure what your question is about the interaction term. Whether you should drop non-significant interaction terms or not is a bit of a can of worms, and depends both on your statistical philosophy and on the goals of your analysis (e.g. see this question)
the denominator degrees of freedom are being calculated according to a simple 'inner-outer' heuristic a simple 'inner-outer' rule described on page 91 of Pinheiro and Bates (2000), which is available on Google Books ... it is generally a reasonable approximation but the computation of degrees of freedom is complex, especially for GLMMs
if you're trying to replicate "A simulation study of sample size for multilevel logistic regression models" by Moineddin et al. (DOI: 10.1186/1471-2288-7-34), you need to run a large number of simulations and compute averages, not just compare a single run. Furthermore, you should probably try to get closer to their methods (coming back to my first point, they state that they use SAS PROC NLMIXED with adaptive Gauss-Hermite quadrature, so you'd be better off with e.g. glmer(...,nAGQ=10); it still won't match exactly, but it'll probably be closer than glmmPQL.

Related Solutions

Solved – Level-2 predictions with lme4/glmer model

I'm not 100% sure I know what you mean by the levels: according to the usual way I've seen this terminology used, level 1 would be "above" level 2, meaning the level of the whole population, so I'm not sure how we can have a "level-1 predictor". Anyway, I'm not sure I need to know, since you can set the fixed effects however you like within newdata. I think the answer to your question is in the help:

ReForm: formula for random effects to condition on. If ‘NULL’, include all random effects; if ‘NA’ or ‘~0’, include no random effects

so ReForm=NA gives population-level predictions (i.e. predictions based on not knowing what ID is being predicted); since you have only a single random effect, using either ReForm=~ID or ReForm=NULL will give predictions conditional on specified IDs. (I see you have set allow.new.levels=TRUE; I'm not sure how that will work with predicting at the ID level ...)

With the development version of lme4:

d <- data.frame(f=factor(rep(LETTERS[1:20],each=30)))
library(lme4)
d$y <- simulate(~1+(1|f),family="gaussian",newdata=d,
    newparams=list(beta=0,theta=1,sigma=0.1),seed=102)[[1]]
m <- lmer(y~1+(1|f),data=d)
newdata <- data.frame(f=factor(LETTERS[1:20]))
predict(m,newdata=newdata,ReForm=NA)  ## all identical
predict(m,newdata=newdata,ReForm=NULL)  ## different by f

(I'm not sure, but the capitalization of ReForm may have changed in the development version -- be careful.)

update: OK, you want to know the average probability of a student at school $j$ repeating a class. I think your approach is reasonable (the answer should be similar to the observed value, although in general it should be a shrinkage estimator [i.e. closer to the overall average). You might also want to consider calculating the probability that an average student at school $j$ would repeat, in which case you would first average the predictors ...

Solved – Estimating Multinomial Multilevel Logistic Models by Binomial models

Before giving up on the existing estimator I would first try to estimate the model with gsem after centering all your explanatory/independent/right-hand-side/x-variables. A multilevel model includes an extra model for the intercept (a normal distribution). You can imagine that estimating such a thing is a lot easier when the intercept is a quantity that can actually occur within your data.

If you want to investigate an alternative estimator you can look into:

Peter Haan and Arne Uhlendorf (2006) "Estimation of multinomial logit models with unobserved heterogeneity using maximum simulated likelihood" The Stata Journal, 6(2): 229-245. http://www.stata-journal.com/article.html?article=st0104

OUTPUT :

Best Answer

Related Solutions

Solved – Level-2 predictions with lme4/glmer model

Solved – Estimating Multinomial Multilevel Logistic Models by Binomial models

Related Question