GLMM Distribution – Which Distribution Family to Choose for My GLMM

rregression

I'am modelling the effect of individuals' chronotype on their respective school performance.
So, my dataframe consist of the subjective and school declared performance (dependent variable) of middle school students (random variable), and their chronotype (independent variable).
The students' subjective performance was measured on a scale from 1 to 5, while the school declared performance was on a scale from 0 to 10. Thus, I Z-standardized these performance values.

#Loading data
data <- read.table("clipboard", header=T)
#Scale function: Z-Score Standardization
data.st <- as.data.frame(scale(data$performance, center = FALSE))
data$performance.st <- data.st$V1
hist(data$performance.st, prob=TRUE, ylim=c(0,1), 
     main = "Histogram", col= "lightblue")

shapiro.test(data$performance.st)
#W = 0.96497, p-value = 6.798e-07

Then I modelled my data considering a poisson family:

# Poisson distribution
glmer.poisson <- glmer(performance ~ chronotype + (1|random), 
               family = poisson(link = log),
                     data =  data)

Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
 Family: poisson  ( log )
Formula: performance ~ chronotype + (1 | random)
   Data: data
     AIC      BIC   logLik deviance df.resid 
     Inf      Inf     -Inf      Inf      310 
Random effects:
 Groups Name        Std.Dev.
 random (Intercept) 1       
Number of obs: 315, groups:  random, 55
Fixed Effects:
 (Intercept)  chronotypemm  chronotypemv  chronotypeve  
    -0.03279      -0.02495      -0.12318       0.01500  
optimizer (Nelder_Mead) convergence code: 0 (OK) ; 29610 optimizer warnings; 1 lme4 warnings

plot(simulateResiduals(glmer.poisson))

However, the GLMM clearly did not fit.
And now I have some questions regarding my data and GLMMs.

First: I don't really know which family to use to model my data. Although I have non-integer values, I think poisson is the right family to use (although lognormal distribution provided nice results). Though, what about the warnings returned in the lm4 package saying that my values are non-integer? and the optimizer warnings? should I change my optimizer? if so, how?

Second: A fellow of mine suggested for me to use gamma or negative binomial instead of poisson. Though, after I see the histogram i thought a lognormal distribution would be a nice try. So I modelled accordingly:

#Gamma distribution
gamma <- glmer(performance ~ chronotype + (1|random), 
               family = Gamma(link="inverse"),
               data =  data)
qqnorm(resid(gamma), pch=16)
qqline(resid(gamma))
plot(gamma)

# Negative binomal distribution
glmer.nb <- glmer.nb(performance ~ chronotype + (1|random),
               data =  data, family=MASS::negative.binomial(theta=1.75))

plot(simulateResiduals(glmer.nb)) 
# The simulated residuals of the glmer.nb were very similar to the poisson model
plot(glmer.nb)
qqnorm(resid(glmer.nb), pch=16)
qqline(resid(glmer.nb))

# lognormal distribution
lognormal <- glmer(formula = log(performance) ~ chronotype + (1|random),
               data = data, family=gaussian(link = identity))
plot(simulateResiduals(lognormal))
plot(lognormal)

qqnorm(resid(lognormal), pch=16)
qqline(resid(lognormal))

The lognormal seems to be a good distribution for my data, but I am not sure.

Finally: Suppose that I'am building a GLMM considering a poisson family. Should I standardize my dependent variable even though poisson use a log link function? I think the correct answer is yes since my data have different scales.

Best Answer

Plots are good.

We often learn a lot from visualizing the raw data. It's more effective than fitting an inappropriate model and then wondering whether the residual QQ plot looks normal "enough".

So I made a few plots of your data. The plots suggest a substantial revision of your analysis; choosing a different distribution family won't be sufficient to make sense of your data.

First, we look at histograms of performance scores by subject (mat, pt, science) and type (school-declared, subjective).

The transformation

data$performance = scale(data$performance, center = FALSE)

assumes subjective scores and school-declared scores have the same variance. The histograms, which are aligned on the x-axis, show the assumption doesn't hold. So the transformation is misapplied.

Once you standardize the scores, you ignore the fact that you are working with different measures of performance. By plotting subjective against school-declared scores, we see that the two measures are correlated only for math. There is little agreement between students' perception and teachers' evaluation of performance in pt and science.

And finally, let's look at (school-declared) performance as a function of choronotype. There seems to be something going on though it will be difficult to estimate the effect of chronotype with high precision as 39 out of 55 students have the in chronotype; the other three types are rare. A lot of the difference is in the spread (variability) rather than the mean, except for pt where mv seems to be associated with lower performance.

Here is the R code to reproduce the figures; I use ggplot2.

library("readxl")
library("tidyverse")

data <-
  read_xlsx(
    "data.xlsx",
    col_types = c("text", "text", "text", "numeric")
  ) %>%
  separate(
    subjects,
    c("type", "subject")
  ) %>%
  mutate(
    type = recode(type, "subj" = "subjective")
  )

data %>%
  ggplot(
    aes(performance)
  ) +
  geom_histogram(
    breaks = 0:10
  ) +
  facet_grid(
    type ~ subject
  )

ggsave("chronotypes1.png", width = 12, height = 8, dpi = 600)

data %>%
  pivot_wider(
    names_from = type,
    values_from = performance
  ) %>%
  ggplot(
    aes(subjective, school)
  ) +
  geom_smooth(
    method = "lm", formula = y ~ x, se = FALSE
  ) +
  geom_jitter(
    width = 0.1,
    height = 0,
    shape = 1,
    size = 2,
    stroke = 1
  ) +
  facet_grid(
    ~subject
  )

ggsave("chronotypes2.png", width = 12, height = 4, dpi = 600)

data %>%
  pivot_wider(
    names_from = type,
    values_from = performance
  ) %>%
  ggplot(
    aes(chronotype, school)
  ) +
  geom_jitter(
    aes(color = chronotype),
    width = 0.1,
    height = 0,
    shape = 1,
    size = 2,
    stroke = 1
  ) +
  facet_grid(
    ~subject
  ) +
  theme(legend.position = "none")

ggsave("chronotypes3.png", width = 12, height = 4, dpi = 600)

Related Solutions

Regression – Choosing the Right Bootstrapped Regression Model

Bootstrapping is a resampling method to estimate the sampling distribution of your regression coefficients and therefore calculate the standard errors/confidence intervals of your regression coefficients. This post has a nice explanation. For a discussion of how many replications you need, see this post.

The nonparametric bootstrap resamples repeatedly and randomly draws your observations with replacement (i.e. some observations are drawn only once, others multiple times and some never at all), then calculates the logistic regression and stores the coefficients. This is repeated $n$ times. So you'll end up with 10'000 different regression coefficients. These 10'000 coefficients can then be used to calculate their confidence itnervals. As a pseudo-random number generator is used, you could just set the seed to an arbitrary number to ensure that you have exactly the same results each time (see example below). To really have stable estimates, I would suggest more than 1000 replications, maybe 10'000. You could run the bootstrap several times and see if the estimates change much whether you do 1000 or 10'000 replications. In plain english: you should take replications until you reach convergence. If your bootstrap estimates vary between your estimates and the observed, single model, this could indicate that the observed model does not appropriately reflect the structure of your sample. The function boot in R, for example, puts out the "bias" which is the difference between the regression coefficients of your single model and the mean of the bootstrap samples.
When performing the bootstrap, you are not interested in a single bootstrap sample, but in the distribution of statistics (e.g. regression coefficients) over the, say, 10'000 bootstrap samples.
I'd say 10'000 is better than 1000. With modern Computers, this shouldn't pose a problem. In the example below, it took my PC around 45 seconds to draw 10'000 samples. This varies with your sample size of course. The bigger your sample size, the higher the number of iterations should be to ensure that every observation is taken into account.
What do you mean "the results vary each time"? Recall that in every bootstrap step, the observations are newly drawn with replacement. Therefore, you're likely to end up with slightly different regression coefficients because your observations differ. But as I've said: you are not really interested in the result of a single bootstrap sample. When your number of replications is high enough, the bootstrap should yield very similar confidence intervals and point estimates every time.

Here is an example in R:

library(boot)

mydata <- read.csv("http://www.ats.ucla.edu/stat/data/binary.csv")

head(mydata)

mydata$rank <- factor(mydata$rank)

my.mod <- glm(admit ~ gre + gpa + rank, data = mydata, family = "binomial")

summary(my.mod)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.989979   1.139951  -3.500 0.000465 ***
gre          0.002264   0.001094   2.070 0.038465 *  
gpa          0.804038   0.331819   2.423 0.015388 *  
rank2       -0.675443   0.316490  -2.134 0.032829 *  
rank3       -1.340204   0.345306  -3.881 0.000104 ***
rank4       -1.551464   0.417832  -3.713 0.000205 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

# Set up the non-parametric bootstrap

logit.bootstrap <- function(data, indices) {

  d <- data[indices, ]
  fit <- glm(admit ~ gre + gpa + rank, data = d, family = "binomial")

  return(coef(fit))
}

set.seed(12345) # seed for the RNG to ensure that you get exactly the same results as here

logit.boot <- boot(data=mydata, statistic=logit.bootstrap, R=10000) # 10'000 samples

logit.boot

Bootstrap Statistics :
        original        bias    std. error
t1* -3.989979073 -7.217244e-02 1.165573039
t2*  0.002264426  4.054579e-05 0.001146039
t3*  0.804037549  1.440693e-02 0.354361032
t4* -0.675442928 -8.845389e-03 0.329099277
t5* -1.340203916 -1.977054e-02 0.359502576
t6* -1.551463677 -4.720579e-02 0.444998099

# Calculate confidence intervals (Bias corrected ="bca") for each coefficient

boot.ci(logit.boot, type="bca", index=1) # intercept
95%   (-6.292, -1.738 )  
boot.ci(logit.boot, type="bca", index=2) # gre
95%   ( 0.0000,  0.0045 ) 
boot.ci(logit.boot, type="bca", index=3) # gpa
95%   ( 0.1017,  1.4932 )
boot.ci(logit.boot, type="bca", index=4) # rank2
95%   (-1.3170, -0.0369 )
boot.ci(logit.boot, type="bca", index=5) # rank3
95%   (-2.040, -0.629 )
boot.ci(logit.boot, type="bca", index=6) # rank4
95%   (-2.425, -0.698 )

The bootstrap-ouput displays the original regression coefficients ("original") and their bias, which is the difference between the original coefficients and the bootstrapped ones. It also gives the standard errors. Note that they are bit larger than the original standard errors.

From the confidence intervals, the bias-corrected ("bca") are usually preferred. It gives the confidence intervals on the original scale. For confidence intervals for the odds ratios, just exponentiate the confidence limits.

Mixed Models in R – Dealing with One Observation Per Level

I would strongly disagree with the practice of fitting a mixed model where you have the same number of groups as observations on conceptual grounds, there are not "groups", and also on computational grounds, as your model should have identifiably issues- in the case of an LMM at least. (I work with LMM exclusively it might be a bit biased also. :) )

The computational part: Assume for example the standard LME model where $y \sim N(X\beta, ZDZ^T + \sigma^2 I)$. Assuming now that you have an equal number of observations and groups (let's say under a "simple" clustering, no crossed or nested effects etc.) then all your sample variance would moved in the $D$ matrix, and $\sigma^2$ should be zero. (I think you convinced yourself for this already) It is almost equivalent of having as many parameters as data in a liner model. You have an over-parametrized model. Therefore regression is a bit nonsensical.

(I don't understand what you mean by "reasonable" AIC. AIC should be computable in the sense that despite over-fitting your data you are still "computing something".)

On the other hand with glmer (lets say you have specified family to be Poisson) you have a link function that says how your $y$ depends on $X\beta$ (in the case of a Poisson that is simple a log - because $X\beta> 0$). In such cases you fix you scale parameter so you can account for over-dispersion and therefore you do have identifiability (and that's why while glmer complained, it did gave you results out); this is how you "get around" the issue of having as many groups as observations.

The conceptual part: I think this a bit more "subjective" but a bit more straightforward also. You use Mixed Eff. models because you essentially recognised that there is some group-related structure in your error. Now if you have as many groups as data-points, there is not structure to be seen. Any deviations in your LM error structure that could be attributed to a "grouping" are now attributed to the specific observation point (and as such you end up with an over-fitted model).

In general single-observation groups tend to be a bit messy; to quote D.Bates from the r-sig-mixed-models mailing list:

I think you will find that there is very little difference in the model fits whether you include or exclude the single-observation groups. Try it and see.

Best Answer

Related Solutions

Regression – Choosing the Right Bootstrapped Regression Model

Mixed Models in R – Dealing with One Observation Per Level

Related Question