Solved – What GLM family and link function to use with similarity index as response variable

generalized linear modelregressionsimilarities

I need to model an unusual response variable (at least to the best of my knowledge) that is: similarity (estimated by Morisita-Horn index) in species composition between pairs of sites.

My response variable is a (continuous) similarity matrix with around 6000 site pairs. (Obs.: I am aware that similarity measures of paired sites are not independent of each other. So, I will need to perform permutation tests to obtain reliable estimates. This is not a problem at the moment.)

The values vary from 0 to 0.99. In my dataset, values are strongly right-skewed, with around 1500 site pairs with similarity < 0.001 (e.g. site pairs that share only one species). When log-transformed, it hardly tends to a normal distribution.

As predictors I have four variables: 3 continuous and one categorical (with three categories). The continuous variables are: geographic distance between each site pair (dist), climatic similarity (clim), and a measure of community dispersal ability (disp), also calculated for each site pair. My categorical variable (env_type) describes the forest types (F1 and/or F2) of the two sites in each pair. Thus, I have a categorical variable with three factors (F1F1, F2F2 or F1F2). Besides the main effects of the above mentioned response variables, I also need to test some interactions among some variables.

So, the question is: What is the indicated family of GLM and the respective link function to deal with this similarity index as response variable?

I found only one paper that applied a Gaussian GLM with log link to similarity data [Gomez-Rodrigues & Baselga (2018) https://onlinelibrary.wiley.com/doi/abs/10.1111/ecog.03693], but in their case the similarity matrix seems to be not so right-skewed. In text books I couldn´t find specific advice on similarities as response variable. I think it is important to highlight that I'm not dealing with proportions directly linked to discrete count data. If so, I could use a binomial distribution, but for similarity indices it seems not suitable, because it is intrinsically continuous. Furthermore, my specific dataset is strongly right-skewed and this imposes an additional problem.

Best Answer

There is no generalized linear model (GLM) for continuous proportion data on (0,1), but there are two approximate possibilities.

The first possibility would be a quasi-GLM family with a beta distribution type of variance function: $$V(\mu)=\mu^\alpha(1-\mu)^\beta.$$ You would however have to estimate the variance parameters $\alpha$ and $\beta$ or set them to somewhat arbitrary values, like $\alpha=\beta=1$ or $\alpha=\beta=0.5$.

For example, you might use a quasi-binomial GLM family in R (by setting family=quasibinomial()) and that would be equivalent to the above variance function with $\alpha=\beta=1$. The quasi-family does not make any assumptions about whether the response is discrete or continuous. The logit link would be appropriate, and that's the default for the quasi-binomial family.

Note it is very important that you use a quasi-binomial model rather than an ordinary binomial model because the former allows a dispersion parameter to be estimated from the data.

The second possibility would be to use a gamma GLM with log-link. Although your response variable is constrained to be $\le 1$, instead of unbounded as the gamma distribution would imply, the gamma model will nevertheless work quite well for your data because the majority of your responses are less than 0.1 with the upper bound not coming much into play.

Related Solutions

Solved – I log transformed the dependent variable, can I use GLM normal distribution with LOG link function

Can I use GLM normal distribution with LOG link function on a DV that has already been log transformed?

Yes; if the assumptions are satisfied on that scale

Is the variance homogeneity test sufficient to justify using normal distribution?

Why would equality of variance imply normality?

Is the residual checking procedure correct to justify choosing the link function model?

You should beware of using both histograms and goodness of fit tests to check the suitability of your assumptions:

1) Beware using the histogram for assessing normality. (Also see here)

In short, depending on something as simple as a small change in your choice of binwidth, or even just the location of the bin boundary, it's possible to get quite different impresssions of the shape of the data:

Two histograms of residuals

That's two histograms of the same data set. Using several different binwidths can be useful in seeing whether the impression is sensitive to that.

2) Beware using goodness of fit tests for concluding that the assumption of normality is reasonable. Formal hypothesis tests don't really answer the right question.

e.g. see the links under item 2. here

About the variance, that was mentioned in some papers using similar datasets "because distributions had homogeneous variances a GLM with a Gaussian distribution was used". If this is not correct, how can I justify or decide the distribution?

In normal circumstances, the question isn't 'are my errors (or conditional distributions) normal?' - they won't be, we don't even need to check. A more relevant question is 'how badly does the degree of non-normality that's present impact my inferences?"

I suggest a kernel density estimate or normal QQplot (plot of residuals vs normal scores). If the distribution looks reasonably normal, you have little to worry about. In fact, even when it's clearly non-normal it still may not matter very much, depending on what you want to do (normal prediction intervals really will rely on normality, for example, but many other things will tend to work at large sample sizes)

Funnily enough, at large samples, normality becomes generally less and less crucial (apart from PIs as mentioned above), but your ability to reject normality becomes greater and greater.

Edit: the point about equality of variance is that really can impact your inferences, even at large sample sizes. But you probably shouldn't assess that by hypothesis tests either. Getting the variance assumption wrong is an issue whatever your assumed distribution.

I read that scaled deviance should be around N-p for the model for a good fit right?

When you fit a normal model it has a scale parameter, in which case your scaled deviance will be about N-p even if your distribution isn't normal.

in your opinion the normal distribution with log link is a good choice

In the continued absence of knowing what you're measuring or what you're using the inference for, I still can't judge whether to suggest another distribution for the GLM, nor how important normality might be to your inferences.

However, if your other assumptions are also reasonable (linearity and equality of variance should at least be checked and potential sources of dependence considered), then in most circumstances I'd be very comfortable doing things like using CIs and performing tests on coefficients or contrasts - there's only a very slight impression of skewness in those residuals, which, even if it's a real effect, should have no substantive impact on those kinds of inference.

In short, you should be fine.

(While another distribution and link function might do a little better in terms of fit, only in restricted circumstances would they be likely to also make more sense.)

Solved – How to get the residuals for a glm with a binary response variable using R

You can use the DHARMa package, which implements the idea of randomized quantile residuals by Dunn and Smyth (1996).

Essentially, the idea is to simulate new data from the fitted model, and compare to the observed data. Details see https://cran.r-project.org/web/packages/DHARMa/vignettes/DHARMa.html

Here an example with a missing quadratic effect in the glm, which shows up in the right plot.

library(DHARMa)

dat = createData(replicates = 1, sampleSize = 300, intercept = -3,
           fixedEffects = 1, quadraticFixedEffects = 20, 
           randomEffectVariance = 0, family = binomial())

fit = glm(observedResponse ~ Environment1 , data = dat, family = binomial)
res = simulateResiduals(fit)
plot(res)

Best Answer

Related Solutions

Solved – I log transformed the dependent variable, can I use GLM normal distribution with LOG link function

Solved – How to get the residuals for a glm with a binary response variable using R

Related Question