Solved – Non-normality in linear mixed models/GLMM

generalized linear modelmixed modelnormality-assumptionresidualsweights

I have some data of time-depth profiles of whales.
I want to model how the maximum depth of each dive (deepest point reached during a dive) changes between two dive types, foraging (if the whale feeds) and non-foraging (if there is no feeding activity). I also have as a fixed effect if the dive was carried out during daytime, twilight or at nightime and as a random effect the whale ID.

  • id = dive ID (each row represents one dive

Example of data:

 id whale max_depths  dive_type  diel
   1    1         57         NF    Day
   2    1         26         NF    Day
   3    1         18         NF    Day
   4    1         23          F  Night  
   5    1         51          F  Night

I tried first to use a linear mixed model. The following had the lowest AIC:

mod3_b <- lme(max_depths ~ dive_type * diel_1, random = ~ 1 | whale,
              data = all_dives_data, na.action = na.exclude, 
              correlation = corARMA(form = ~ 1 | whale, p = 1, q = 2),
              weights = varIdent(form = ~ 1 | diel_1))

I had a problem of high autocorrelation which was solved by using corARMA() and heteroskedacity which decreased significantly by adding weights (probably because twilight has significantly less data than day and night).
Nevertheless, my residuals are not normal
(graphs below):

enter image description here

enter image description here

Due to this, I tried to use instead a GLMM.

My questions are:

  • 1- residuals of GLMM are still not normal. Is that a problem?
  • 2- should I transform the data instead? (I think this increases the heteroskedacity problem though)
  • 3- what distribution is better (I tried both poisson and neg binomial) or how do I compare GLMM models (since they don't have AIC)?
  • 4- Can I had weights to GLMM? And if yes, how? I tried doing but with no success (I guess it may be important since diel categories are not equally represented in the data)

UPDATE

resulting plot of standardized residuals vs fitted values of the model with max_depths log transformed:

LMM:

enter image description here

GLMM:

enter image description here

QQ and hist of log transformed data

enter image description here

enter image description here

Best Answer

There are a few points here:

The research question concerns the association of maximum dive depth by whales with whether the dives are of foraging or non-foraging type.

After fixing problems with autocorrelation, the main problem concerns the normality of residuals. Even with non normal residuals a model may still be useful but in order to make valid inferences, we would like them to be independent and close to normally distributed.

The QQ plot and histogram shows clear departures from normality. A GLMM was considered but given that the outcome variable is dive depth there is no obvious distribution candidate for a GLMM. On thing to consider is whether the outcome is bounded. Obviously it is bounded by zero, but presumably it is also bounded by the ocean floor ! With bounded data, a gamma GLMM could be considered, but based on the plots, it would seem appropriate to try to improve the fit of the LMM first. Also, it seems obvious that the upper bound on the sea depth will be variable and possibly unknown, so that's another reason to keep it simple.

Log-transforming the response has improved the QQ plot and histogram of residuals considerably and these can be considered as close to normal.

The plot of residuals vs fitted values appears a little strange, mostly because there is a small gap in the middle. There are very few fitted values between approximately 4.85 and 5.1 (on the log scale presumably), so the model is predicting very few dive depths in this range. Could there be any physical/biological reason for this ? Another (possibly related) explanation is an omitted binary/categorical variable (something related to dive depth like shallow/deep but probably not that simple/obvious). Anyway, I don't consider this to be much of a problem.

Related Question