Solved – What GLM family and link function to use with similarity index as response variable

generalized linear modelregressionsimilarities

I need to model an unusual response variable (at least to the best of my knowledge) that is: similarity (estimated by Morisita-Horn index) in species composition between pairs of sites.

My response variable is a (continuous) similarity matrix with around 6000 site pairs. (Obs.: I am aware that similarity measures of paired sites are not independent of each other. So, I will need to perform permutation tests to obtain reliable estimates. This is not a problem at the moment.)

The values vary from 0 to 0.99. In my dataset, values are strongly right-skewed, with around 1500 site pairs with similarity < 0.001 (e.g. site pairs that share only one species). When log-transformed, it hardly tends to a normal distribution.

histogram of untransformed similarities. * note the first class compromises the very low similarities values
histogram of log-transformed similarities

As predictors I have four variables: 3 continuous and one categorical (with three categories). The continuous variables are: geographic distance between each site pair (dist), climatic similarity (clim), and a measure of community dispersal ability (disp), also calculated for each site pair. My categorical variable (env_type) describes the forest types (F1 and/or F2) of the two sites in each pair. Thus, I have a categorical variable with three factors (F1F1, F2F2 or F1F2). Besides the main effects of the above mentioned response variables, I also need to test some interactions among some variables.

So, the question is: What is the indicated family of GLM and the respective link function to deal with this similarity index as response variable?

I found only one paper that applied a Gaussian GLM with log link to similarity data [Gomez-Rodrigues & Baselga (2018)], but in their case the similarity matrix seems to be not so right-skewed. In text books I couldnĀ“t find specific advice on similarities as response variable. I think it is important to highlight that I'm not dealing with proportions directly linked to discrete count data. If so, I could use a binomial distribution, but for similarity indices it seems not suitable, because it is intrinsically continuous. Furthermore, my specific dataset is strongly right-skewed and this imposes an additional problem.

Best Answer

There is no generalized linear model (GLM) for continuous proportion data on (0,1), but there are two approximate possibilities.

The first possibility would be a quasi-GLM family with a beta distribution type of variance function: $$V(\mu)=\mu^\alpha(1-\mu)^\beta.$$ You would however have to estimate the variance parameters $\alpha$ and $\beta$ or set them to somewhat arbitrary values, like $\alpha=\beta=1$ or $\alpha=\beta=0.5$.

For example, you might use a quasi-binomial GLM family in R (by setting family=quasibinomial()) and that would be equivalent to the above variance function with $\alpha=\beta=1$. The quasi-family does not make any assumptions about whether the response is discrete or continuous. The logit link would be appropriate, and that's the default for the quasi-binomial family.

Note it is very important that you use a quasi-binomial model rather than an ordinary binomial model because the former allows a dispersion parameter to be estimated from the data.

The second possibility would be to use a gamma GLM with log-link. Although your response variable is constrained to be $\le 1$, instead of unbounded as the gamma distribution would imply, the gamma model will nevertheless work quite well for your data because the majority of your responses are less than 0.1 with the upper bound not coming much into play.

Related Question