Solved – Dissimilarity in Clustering

clusteringrstandardization

In clustering data you normally choose a dissimilarity measure such as euclidean and find a clustering method which best suits your data and each method has several algorithms which can be applied. For example, lets say I want to use hierarchical clustering, with the maximum distance measure and single linkage algorithm.

This is my understanding so far to how you would cluster data in a statistical software such as R.

In my case I have constructed my own dissimilarity matrix via ks.boot in R to calculate p-values for all my studies so 10 studies would generate a 10×10 matrix of p-values. I then subtracted 1 from each p-value and covered this matrix into a distance object. I used hclust to cluster my data using single, average and complete algorithms.

data <- read.csv(file="Data.csv",header=TRUE)
data <- as.data.frame(data)
mat <- outer(1:30, 1:30, Vectorize(function(i,j)
{ks.boot(as.numeric(rep(seq(0,14,1),as.vector(data[i,]))),
 as.numeric(rep(seq(0,14,1),as.vector(data[j,]))))$ks.boot.pvalue}))
d=as.dist(as.matrix(1-mat))
hc1_c <- hclust(d,method = "complete")
hc1_a <- hclust(d,method = "average")
hc1_s <- hclust(d,method = "single")
plot(hc1_c)
plot(hc1_a)
plot(hc1_s)

I'm slightly confused with:

  1. Standardisation\transformation of data and why it's necessary. And where did I standardise my data. Maybe when I converted the matrix into an object?
  2. Whats the name of the method I have used as a dissimilarity measure and how are p-values a viable method for calculating distances between objects. i.e why would you use a K-S test or AD/Cramer-von p value as a measure of distance.
  3. How can I visualise the uncertainty in clustering and compare which algorithm (single, average, complete) clusters my data the best.

Best Answer

  1. Standardization is not necessary. It is often better than not standardizing when you have attributes of different scale. But it a heuristic, not a must. You use his before computing distances, so it does not apply in your use case.
  2. You don't need a name for this. It's just a distance matrix. You would not even need the 1-p transformation if you used an implementation of Single-Link for similarity rather than for distance functions (the same algorithm can be implemented for similarities, too).
  3. Uncertainty here will not be reliable. You'd need much more data to quantify uncertainty.