I'am modelling the effect of individuals' chronotype on their respective school performance.
So, my dataframe consist of the subjective and school declared performance (dependent variable) of middle school students (random variable), and their chronotype (independent variable).
The students' subjective performance was measured on a scale from 1 to 5, while the school declared performance was on a scale from 0 to 10. Thus, I Z-standardized these performance values.
#Loading data
data <- read.table("clipboard", header=T)
#Scale function: Z-Score Standardization
data.st <- as.data.frame(scale(data$performance, center = FALSE))
data$performance.st <- data.st$V1
hist(data$performance.st, prob=TRUE, ylim=c(0,1),
main = "Histogram", col= "lightblue")
shapiro.test(data$performance.st)
#W = 0.96497, p-value = 6.798e-07
Then I modelled my data considering a poisson family:
# Poisson distribution
glmer.poisson <- glmer(performance ~ chronotype + (1|random),
family = poisson(link = log),
data = data)
Generalized linear mixed model fit by maximum likelihood (Laplace Approximation) ['glmerMod']
Family: poisson ( log )
Formula: performance ~ chronotype + (1 | random)
Data: data
AIC BIC logLik deviance df.resid
Inf Inf -Inf Inf 310
Random effects:
Groups Name Std.Dev.
random (Intercept) 1
Number of obs: 315, groups: random, 55
Fixed Effects:
(Intercept) chronotypemm chronotypemv chronotypeve
-0.03279 -0.02495 -0.12318 0.01500
optimizer (Nelder_Mead) convergence code: 0 (OK) ; 29610 optimizer warnings; 1 lme4 warnings
plot(simulateResiduals(glmer.poisson))
However, the GLMM clearly did not fit.
And now I have some questions regarding my data and GLMMs.
First: I don't really know which family to use to model my data. Although I have non-integer values, I think poisson is the right family to use (although lognormal distribution provided nice results). Though, what about the warnings returned in the lm4 package saying that my values are non-integer? and the optimizer warnings? should I change my optimizer? if so, how?
Second: A fellow of mine suggested for me to use gamma or negative binomial instead of poisson. Though, after I see the histogram i thought a lognormal distribution would be a nice try. So I modelled accordingly:
#Gamma distribution
gamma <- glmer(performance ~ chronotype + (1|random),
family = Gamma(link="inverse"),
data = data)
qqnorm(resid(gamma), pch=16)
qqline(resid(gamma))
plot(gamma)
# Negative binomal distribution
glmer.nb <- glmer.nb(performance ~ chronotype + (1|random),
data = data, family=MASS::negative.binomial(theta=1.75))
plot(simulateResiduals(glmer.nb))
# The simulated residuals of the glmer.nb were very similar to the poisson model
plot(glmer.nb)
qqnorm(resid(glmer.nb), pch=16)
qqline(resid(glmer.nb))
# lognormal distribution
lognormal <- glmer(formula = log(performance) ~ chronotype + (1|random),
data = data, family=gaussian(link = identity))
plot(simulateResiduals(lognormal))
plot(lognormal)
qqnorm(resid(lognormal), pch=16)
qqline(resid(lognormal))
The lognormal seems to be a good distribution for my data, but I am not sure.
Finally: Suppose that I'am building a GLMM considering a poisson family. Should I standardize my dependent variable even though poisson use a log link function? I think the correct answer is yes since my data have different scales.
Best Answer
Plots are good.
We often learn a lot from visualizing the raw data. It's more effective than fitting an inappropriate model and then wondering whether the residual QQ plot looks normal "enough".
So I made a few plots of your data. The plots suggest a substantial revision of your analysis; choosing a different distribution family won't be sufficient to make sense of your data.
First, we look at histograms of performance scores by subject (mat, pt, science) and type (school-declared, subjective).
The transformation
assumes subjective scores and school-declared scores have the same variance. The histograms, which are aligned on the x-axis, show the assumption doesn't hold. So the transformation is misapplied.
Once you standardize the scores, you ignore the fact that you are working with different measures of performance. By plotting subjective against school-declared scores, we see that the two measures are correlated only for math. There is little agreement between students' perception and teachers' evaluation of performance in pt and science.
And finally, let's look at (school-declared) performance as a function of choronotype. There seems to be something going on though it will be difficult to estimate the effect of chronotype with high precision as 39 out of 55 students have the
in
chronotype; the other three types are rare. A lot of the difference is in the spread (variability) rather than the mean, except for pt wheremv
seems to be associated with lower performance.Here is the R code to reproduce the figures; I use ggplot2.