Solved – Linear mixed effects models: what to do when the residual QQ-plot looks non-normal

data transformationhistogrammixed modelregressionresiduals

I have four linear mixed effect models of similar structure:

model1 <- lmer(index1 ~ biophony + anthrophony + (1|Site), data=df, REML=F) 
model2 <- lmer(index2 ~ biophony + anthrophony + (1|Site), data=df, REML=F) 
model3 <- lmer(index3 ~ biophony + anthrophony + (1|Site), data=df, REML=F) 
model4 <- lmer(index4 ~ biophony + anthrophony + (1|Site), data=df, REML=F) 

These models are testing the relationship between biophony (sounds generated by biodiversity) and anthrophony (sounds generated by humans) with four different indices for bioacoustic diversity, with Site as a random effect.

Index 1 is a sum of positive values. Index 2 is a sum of proportions. Index 3 is the area under the curve on a plot of frequency (Hz) against decibels (dB). Index 4 is a ratio of power in two frequency bins. Therefore indices 1-3 can only be positive. Index 4 is bounded by -1 to +1.

Using the following code to generate normal QQ-plots of the residuals:

qqnorm(residuals(model1))
abline(0,1)

I have found that the residuals of the models look very non-normal (see plots below). The first plot appears to be heavy-tailed. The rest I do not know what distribution the plots indicate (I can't find any examples online of similar plots).

Normal Q-Q Plots of Models 1-4

I have already tried log transforming the indices data but this hasn't improved the distribution of the data. I have tried using the boxcox function to find an appropriate power transformation as described here, but again this did not improve the distribution of the residuals.

My questions are:

  1. Should I transform the data? If so can you recommend appropriate transformations?
  2. If a transformation isn't appropriate, should I use generalised linear mixed models to analyse these data? Could you recommend what family and link functions would be appropriate for a glmer analysis of data of these distributions?

I have attached histograms of the marginal distributions of the indices data for more information:

Histograms of indices 1-4

Best Answer

It would seem that the distribution of the response conforms to a Poisson distribution. Though I would say that the model itself should be modeled using a log link I am unsure if the lmer function allows for generalization of the residuals. However, I believe the correct extraction of residuals for count data should be 'Pearson'.One other thing to consider is that your abline() function is set through zero with a slope of one, in other words the reference line will never actually follow the true theoretical quantiles. Use the function qqline()in order to rectify that.