Solved – the most appropriate transformation method for performing analyses on species composition data

biostatisticsdata transformationdiversity

I would like to compare differences in fish diets between sampling sites using a Bray-Curtis dissimilarity matrix and non-metric multidimensional scaling techniques. My raw data consists of counts of items in each taxonomic category in each stomach, with 10 stomachs per site and 12 sites. Here's an example of my data:

enter image description here

I am using the vegan package in R to conduct my analyses. Similar studies in the literature have used arcsine transformation on proportion data using the following type of formula:

mutate(arcsin = ((2/pi)*asin(sqrt(relabund)))) 

where 'relabund' is the relative abundance of each taxa in each stomach as a percentage.

Others who have used this method reference Zar 1999 (the 4th edition of his book on Biostatistical Analysis), however in reading his chapter on data transformations, he states that "…it is recommended that data transformation is not warranted for analysis of variance with binomial data unless the largest sample size is more than five times greater than the smallest and the smaller variances are associated with the smaller samples." My sample sizes are equal so am not sure how to interpret this. He also goes on to say that "This transformation is not as good at the extreme ends of the range of possible values (i.e. near 0 and 100%)." In community composition data, most of your values are close to 0 or 100, therefore I'm wondering if this is indeed the best transformation method?

I would love some guidance around using these techniques to make sure I don't make any incorrect assumptions!

I suppose my main questions to the community are:

  1. When comparing species composition data, is it best to use the raw counts or proportions?

  2. When using Bray-Curtis or doing NMDS, do you need to transform your data? I know the metaMDS function has an argument 'autotransform=TRUE/FALSE', which uses Wisconsin double standardization or square root transformation for larger values…these seem somewhat arbitrary as I would think the method of transformation depends on the type of data being used. This is why I was considering using 'autotransform = FALSE' and doing my own transformation.

  3. If you do transform, what type of transformation makes the most sense for species composition data? As you can see there are a lot of zeros, so I would like to somehow transform the data in a way that spreads out the data a bit more, rather than having lots of data close to 0% or close to 100% (if working with proportions).

Any insight you might have would be greatly appreciated!

Best Answer

I have been searching for answers to this question too. I came across this very useful discussion from from years ago:

[ORDNEWS:1593] log, sqrt and other transformation with Bray-Curtis dissimilarity

The purpose of using a sqrt transformation seems to be to reduce the relative influence of the most frequent species, which otherwise will tend to dominate the dissimilarity matrix, and also are often quite variable in number (according to the discussion). Furthermore we may be somewhat more interested in the rarer species. An even stronger downweighting can be achieve using log(1+x).

The Wisconsin scaling removes the effect of absolute species abundance and also abundance between sites, so everything becomes relative.

The Bray-Curtis measure outperforms other measures in many cases, and only compares species that are present at one of the sites, which means that double zeros are (correctly) ignored.

I am thinking that the default scaling of metaMDS is likely to be well founded, I just wish it was a bit more transparent.

Related Question