Solved – Non-normality in linear mixed models/GLMM

I have some data of time-depth profiles of whales.
I want to model how the maximum depth of each dive (deepest point reached during a dive) changes between two dive types, foraging (if the whale feeds) and non-foraging (if there is no feeding activity). I also have as a fixed effect if the dive was carried out during daytime, twilight or at nightime and as a random effect the whale ID.

id = dive ID (each row represents one dive

Example of data:

 id whale max_depths  dive_type  diel
   1    1         57         NF    Day
   2    1         26         NF    Day
   3    1         18         NF    Day
   4    1         23          F  Night  
   5    1         51          F  Night

I tried first to use a linear mixed model. The following had the lowest AIC:

mod3_b <- lme(max_depths ~ dive_type * diel_1, random = ~ 1 | whale,
              data = all_dives_data, na.action = na.exclude, 
              correlation = corARMA(form = ~ 1 | whale, p = 1, q = 2),
              weights = varIdent(form = ~ 1 | diel_1))

I had a problem of high autocorrelation which was solved by using corARMA() and heteroskedacity which decreased significantly by adding weights (probably because twilight has significantly less data than day and night).
Nevertheless, my residuals are not normal
(graphs below):

Due to this, I tried to use instead a GLMM.

My questions are:

1- residuals of GLMM are still not normal. Is that a problem?
2- should I transform the data instead? (I think this increases the heteroskedacity problem though)
3- what distribution is better (I tried both poisson and neg binomial) or how do I compare GLMM models (since they don't have AIC)?
4- Can I had weights to GLMM? And if yes, how? I tried doing but with no success (I guess it may be important since diel categories are not equally represented in the data)

UPDATE

resulting plot of standardized residuals vs fitted values of the model with max_depths log transformed:

LMM:

GLMM:

QQ and hist of log transformed data

Best Answer

There are a few points here:

The research question concerns the association of maximum dive depth by whales with whether the dives are of foraging or non-foraging type.

After fixing problems with autocorrelation, the main problem concerns the normality of residuals. Even with non normal residuals a model may still be useful but in order to make valid inferences, we would like them to be independent and close to normally distributed.

The QQ plot and histogram shows clear departures from normality. A GLMM was considered but given that the outcome variable is dive depth there is no obvious distribution candidate for a GLMM. On thing to consider is whether the outcome is bounded. Obviously it is bounded by zero, but presumably it is also bounded by the ocean floor ! With bounded data, a gamma GLMM could be considered, but based on the plots, it would seem appropriate to try to improve the fit of the LMM first. Also, it seems obvious that the upper bound on the sea depth will be variable and possibly unknown, so that's another reason to keep it simple.

Log-transforming the response has improved the QQ plot and histogram of residuals considerably and these can be considered as close to normal.

The plot of residuals vs fitted values appears a little strange, mostly because there is a small gap in the middle. There are very few fitted values between approximately 4.85 and 5.1 (on the log scale presumably), so the model is predicting very few dive depths in this range. Could there be any physical/biological reason for this ? Another (possibly related) explanation is an omitted binary/categorical variable (something related to dive depth like shallow/deep but probably not that simple/obvious). Anyway, I don't consider this to be much of a problem.

Best Answer

Related Solutions

Solved – Non-normality of residuals in a negative binomial GLMM

Solved – generalized linear mixed models vs linear mixed effect models

Related Question