In clustering data you normally choose a dissimilarity measure such as euclidean and find a clustering method which best suits your data and each method has several algorithms which can be applied. For example, lets say I want to use hierarchical clustering, with the maximum distance measure and single linkage algorithm.
This is my understanding so far to how you would cluster data in a statistical software such as R.
In my case I have constructed my own dissimilarity matrix via ks.boot
in R to calculate p-values for all my studies so 10 studies would generate a 10×10 matrix of p-values. I then subtracted 1 from each p-value and covered this matrix into a distance object. I used hclust
to cluster my data using single, average and complete
algorithms.
data <- read.csv(file="Data.csv",header=TRUE)
data <- as.data.frame(data)
mat <- outer(1:30, 1:30, Vectorize(function(i,j)
{ks.boot(as.numeric(rep(seq(0,14,1),as.vector(data[i,]))),
as.numeric(rep(seq(0,14,1),as.vector(data[j,]))))$ks.boot.pvalue}))
d=as.dist(as.matrix(1-mat))
hc1_c <- hclust(d,method = "complete")
hc1_a <- hclust(d,method = "average")
hc1_s <- hclust(d,method = "single")
plot(hc1_c)
plot(hc1_a)
plot(hc1_s)
I'm slightly confused with:
- Standardisation\transformation of data and why it's necessary. And where did I standardise my data. Maybe when I converted the matrix into an object?
- Whats the name of the method I have used as a dissimilarity measure and how are p-values a viable method for calculating distances between objects. i.e why would you use a K-S test or AD/Cramer-von p value as a measure of distance.
- How can I visualise the uncertainty in clustering and compare which algorithm (single, average, complete) clusters my data the best.
Best Answer