Solved – Is it possible to use Hellinger distance for environmental variables

classificationclusteringdistance-functionseuclidean

Here is the problem, Euclidean distance is not recommended for datasets with many zeroes (like matrices of species/site), as there is the risk of the abundance paradox (Orloci, 1978). Whereas to calculate environmental distance (i.e., using Temperature and Precipitation variables) the Euclidean distance is widely used. The problem is these are not easily comparable. Is it correct to use Hellinger distance on environmental variables (normal distribution)?

Best Answer

It is pretty easy to compare two dissimilarity matrices (assuming that is what you mean by compare?).

For example, you could ordinate the dissimilarity matrices separately and compare them with Procrustes rotation. Or there is the method of co-intertia analysis which extracts axes that maximise the covariance between the two data sets (cf PCA which extracts axes of maximal variance in the one data set) subject to axes being orthogonal. Co-inertia is based on Euclidean distances so you could apply the Hellinger transformation to the species data and leave the environmental data untransformed, or you might transform some of the env data using say a log transformation.

Mantel's (partial) test can also be used to compare associations between two or more dissimilarity matrices.

Related Solutions

Solved – How to compute a measure of distance between sites with continuous variables

I tried this and got different results (as expected) from the distances of a data frame and its transpose:

library(ade4)
x1 <- rnorm(10, 2, 1)
x2 <- rnorm(10,1,1)
dframe <- cbind(x1,x2)
dist1 <- dist.quant(dframe, 1, diag = TRUE, upper = TRUE)
dist1
dist2 <- dist.quant(t(dframe),1, diag = TRUE, upper = TRUE)
dist2

dist2 gives a single distance (between x1 and x2). dist1 gives a $10\times10$ matrix (since I put upper = TRUE and diagonal = TRUE)

Solved – Why does k-means clustering algorithm use only Euclidean distance metric

K-Means procedure - which is a vector quantization method often used as a clustering method - does not explicitly use pairwise distances between data points at all (in contrast to hierarchical and some other clusterings which allow for arbitrary proximity measure). It amounts to repeatedly assigning points to the closest centroid thereby using Euclidean distance from data points to a centroid. However, K-Means is implicitly based on pairwise Euclidean distances between data points, because the sum of squared deviations from centroid is equal to the sum of pairwise squared Euclidean distances divided by the number of points. The term "centroid" is itself from Euclidean geometry. It is multivariate mean in euclidean space. Euclidean space is about euclidean distances. Non-Euclidean distances will generally not span Euclidean space. That's why K-Means is for Euclidean distances only.

But a Euclidean distance between two data points can be represented in a number of alternative ways. For example, it is closely tied with cosine or scalar product between the points. If you have cosine, or covariance, or correlation, you can always (1) transform it to (squared) Euclidean distance, and then (2) create data for that matrix of Euclidean distances (by means of Principal Coordinates or other forms of metric Multidimensional Scaling) to (3) input those data to K-Means clustering. Therefore, it is possible to make K-Means "work with" pairwise cosines or such; in fact, such implementations of K-Means clustering exist. See also about "K-means for distance matrix" implementation.

It is possible to program K-means in a way that it directly calculate on the square matrix of pairwise Euclidean distances, of course. But it will work slowly, and so the more efficient way is to create data for that distance matrix (converting the distances into scalar products and so on - the pass that is outlined in the previous paragraph) - and then apply standard K-means procedure to that dataset.

Please note I was discussing the topic whether euclidean or noneuclidean dissimilarity between data points is compatible with K-means. It is related to but not quite the same question as whether noneuclidean deviations from centroid (in wide sense, centre or quasicentroid) can be incorporated in K-means or modified "K-means".

Best Answer

Related Solutions

Solved – How to compute a measure of distance between sites with continuous variables

Solved – Why does k-means clustering algorithm use only Euclidean distance metric

Related Question