Solved – Double zeroes problem with euclidean distance and abundance data – is the problem widely varying abundances or mutually missing taxa

clusteringdata visualizationpca

Background:

I have data from a study of trailside and forest interior transects, where half my transects are on-trail and half in the forest.

The data is in the form of raw species abundances, where columns represent species and rows represent individual transects.

My goal is to compare the transects which are on-trail with those off-trail by doing a PCA analysis to create an ordination plot, which would use euclidean distance. I want to see if on and off trail transects cluster together on the plots. I'm interested in both whether or not transects have different species present, and also what the total abundance of all species is for each transect – they both seem like 'real' and 'valid' differences.

Because my abundances vary wildly, from over 50 individuals at one transect to 9 or so at another, and I also have many many species which turn up at only one or two sites, leading to many sites which mutually share a number of species, I expect the 'double-zero' problem outlined below to affect my plot. From my reading of these two sources, it seems like the varied abundance is the source of the problem, and not the mere fact that there are mutually absent species. I don't know, and want clarification on this.

I'm considering using relative abundances to do my PCA instead of raw abundances. In that case, would the outcome of my ordination plot reflect differences in species composition? Since I did t-tests on the abundance data to see if it was different on and off trail, I already have a measure of whether abundance is different. I would be happy using the ordination plots to visualize species composition, sans abundance difference, if that is what such a plot would show.

I am just learning about distance measures and PCA in general. These sources are my first introduction to both ordination methods and the double-zeroes problem.


Question:

"The double-zero problem shows up in ecology because of the special nature of
species descriptors… If a species is present at two sites, this
is an indication of the similarity of these sites; but if a species is absent from two sites,
it may be because the two sites are both above the optimal niche value for that species,
or both are below, or else one site is above and the other is below that value. One
cannot tell which of these circumstances is the correct one.
It is thus preferable to abstain from drawing any ecological conclusion from the
absence of a species at two sites. In numerical terms, this means to skip double zeros
altogether when computing similarity or distance coefficients using species presence-
absence or abundance data. On the other hand, the presence of
a species at one of two sites and its absence at the other are considered as a difference
between these sites."
http://www.ievbras.ru/ecostat/Kiril/R/Biblio/Statistic/Legendre%20P.,%20Legendre%20L.%20Numerical%20ecology.pdf; Section 7.3

  • Questions:
    • Is it true that presence, whatever the recorded abundance, shows similarity while absence does not – a low abundance of like 1
      individual poses the same ambiguity in interpretation as an absolute
      zero, and there's no cutoff for when the abundance is high enough to
      not mean anything – every abundance score is on a continuum, so why not treat every score, including 0, equally?
    • The absence of species at two sites is a similarity, in that if you are interested just in seeing what species are present or absent
      and in what abundances, two sites will appear similar and be similar
      if they are both missing something. That the similarity in actual
      presence/absence might be driven by different phenomena wouldn't
      negate its presence.

      • But then you might be interested in the phenomena
        more than in just the raw facts about presence/absence at a site.

"In
symmetrical
coefficients, the state
zero
for two objects is treated in exactly the
same way as any other pair of values, when computing a similarity. These coefficients
can be used in cases where the state
zero
is a valid basis for comparing two objects and represents the same kind of information as any other value. This obviously excludes
the special case where
zero
means “lack of information”. For example, finding that two
lakes in winter have 0 mg L
–1
of dissolved oxygen in the hypolimnion provides
valuable information concerning their physical and chemical similarity and their
capacity to support species." – I think this is actually the case when comparing sites based on species abundances and presence/absence

"Including double-zeros in the comparison between sites
would result in high values of similarity for the many pairs of sites holding only a few
species; this would not reflect the situation adequately."

  • If by 'reflect the situation' one means accounting for the large abundance differences as well as the species composition difference, then doesn't including double zeroes actually reflect the situation?

Two sites with low abundances and no species overlap would be closer than one of those plus a third site with some species overlap but wickedly high abundances. But instead of being due inherently to a preponderance of zeroes, isn't it due more to the wildly varying abundances – and so having sites which mutually lack a species or many isn't inherently problematic?

I got this interpretation from the following excerpt, but the book I quoted above seems to say that just having sites mutually lack a species is a problem, is THE problem, and takes steps to correct this.

Excerpt:

"So what’s the deal with Euclidean distance and double zeroes? Obviously the zeroes cancel, just as in other metrics. The issue comes up when you use Euclidean distances on raw abundances and attempt to make inference about species composition, which leads to the so-called paradox of Euclidean distances. Let’s take the example matrix:

\begin{bmatrix} 0 & 4 & 8 \\ 0 & 1 & 1 \\ 1 & 0 & 0 \end{bmatrix}

Sites 1 and 2 share two species in common, while Site 3 is all by its one-sies. If you calculate the Euclidean distances between these sites, you get:

\begin{bmatrix} 0 & 7.62 & 9 \\ 7.62 & 0 & 1.73 \\ 9 & 1.73 & 0 \end{bmatrix}

Sites 2 and 3 are more similar than Sites 1 and 2, even though Site 3 shares no species in common! Let’s try it on the chord distances. Doing that, we get:

\begin{bmatrix} 0 & 0.32 & 1.41 \\ 0.32 & 0 & 1.41 \\ 1.41 & 1.41 & 0 \end{bmatrix}

That’s better. Now Site 3 is equally distant from both Sites 1 and 2 since it shares no species in common with either of them. So what the hell? This is why it’s termed a paradox. Here’s a hint: the answer isn’t that Euclidean distance counts double zeroes while Chord does not, as shown above. Especially since Chord is Euclidean, it uses the exact same equation.

The answer is actually much simpler, and non-mathy. Euclidean distances on raw abundance values place a premium on differences in the number of individuals, not species. So it’s actually getting it right. Sites 2 and 3 have 2 and 1 individuals total, respectively. When you take the difference, you’re basically counting up the number of individuals the sites do not share. In that case, it happens to be that Sites 2 and 3 only have three individuals that differ between them. Sites 1 and 3 have 13 individuals that differ between them, and Sites 1 and 2 have 10 individuals that differ between them. So by this math, Sites 2 and 3 actually should be really similar."
https://climateecology.wordpress.com/2014/12/11/an-intuitive-explanation-for-the-double-zeroes-problem-with-euclidean-distances/

Granted, the double absences might usually go along with variable abundances but is it the zeroes that are the problem, really?

Best Answer

In a nutshell: the problem is having samples that are all-zero. Widely varying abundances and missing taxa are not so much of a problem. Read on for how I tackled this sort of analysis recently.

First I would highly recommend reading this source:

Clarke, K. R., & Warwick, R. M. (2001). Change in Marine Communities: An Approach to Statistical Analysis and Interpretation (2nd ed.). Plymouth: PRIMER-E. (PDF)

They outline a step-by-step method for analysing the sort of data you have collected (don't worry about the marine examples, its exactly the same with terrestrial species). Their recommendations:

  • Use MDS over PCA (for 2 reasons outlined in that paper).
  • Use Bray-Curtis for calculating (dis)similarity matrices rather than euclidean distance. They advocate euclidean distance only for constructing matrices with environmental variables, not species abundances (for reasons in that paper).

I have been following this process for analysing my own data which is similar to yours (I struggled at first). I've been using PRIMER-e software but its possible to do this in R. If you have access to PRIMER-e and the user guide / tutorial then I would highly recommend you read it.

  1. Organise data as species-as-rows, samples-as-columns. Depending on your data and the questions you are asking it may be appropriate to sum replicates together, or to keep them separate, or both.

  2. Standardise your data. If your sampling effort is uneven you will need to standardise your data - this is outlined in the above source. I didn't need to do this because I used the same number of traps, in the same places, left out for the same amount of time during my project.

  3. Transform your species abundance data. Typically, raw abundances are transformed prior to analysis. Usually you will use square root, fourth-root, log(X+1), or presence-absence (square root being least extreme, P/A being most). I would start with square root.

  4. Normalise any environmental data. Normally, for each variable you want to subtract the mean and divide by the standard variation for that variable. This is so that different variables with wildly different data ranges can be compared on the same scale. For example temperatures of 10-20 degrees can't be compared to numbers of saplings in the hundreds.

  5. Construct species abundance dissimilarity matrices with Bray-Curtis. If your data contains samples that are all-zero you will run into the double zero problem. This can be overcome by using a zero-adjusted Bray-Curtis coefficient, which is sometimes referred to as a 'dummy variable' which damps down the similarity fluctuations between samples that are both zero (undefined). They explain the zero-adjusted Bray-Curtis method in a paper here.

  6. Construct an environmental dissimilarity matrix using euclidean distance (if you have environmental data like climate, vegetation measurements etc).

  7. Carry out agglomerative, hierarchical clustering (Chapter 3) to determine any natural groupings.

  8. Carry out MDS (Chapter 5). For your purposes, you would add 'trailside' and 'interior' as factors, and then plot the MDS with those factors as different symbols.

  9. Interpret the MDS plots. They suggest overlaying the cluster analysis at different similarity levels to identify clusters. You can also turn the MDS into a 'bubble plot' by overlaying different environmental variables which you can add as factors.

  10. From here you can do things like find out the best combination of environmental variables that explain the abundances (BEST function in PRIMER-e) or compare two different matrices (for example 'species abundance' and vegetation presence-absence' - RELATE in PRIMER-e).

Good luck! It can be difficult to understand the appropriate method from the outset. I would recommend seeing someone with a statistical background (or even better, ecology and stats). They will most likely be able to point you in the right direction and answer the sorts of questions you are likely to have.

Related Question