I am trying to do some multivariate cluster analysis as follows:
I have a file in which I have the data and I perform the cluster analysis using k-means:
data <- read.csv("data_file")
str(data)
'data.frame': 10 obs. of 3 variables:
$ A : num 2.64 2.01 2.02 1.85 1.94 ...
$ B : num 5.45 5.14 5.16 4.82 4.92 ...
$ C : num 7.58 7.66 7.74 7.57 7.52 ...
data2 <- scale(data)
fit1 <- kmeans(data2, 3)
fit1$cluster
[1] 2 2 2 1 1 1 3 3 3 3
fit1$center
A B C
1 0.1524144 -1.0545162 0.5133913
2 1.0523695 0.8632014 0.9234564
3 -0.9035879 0.1434861 -1.0776358
Now, I have the three clusters and for each cluster I have the centroids coordinates. I would like now to have a representative item for each cluster. It is important that the representative item is part of the data.
So, what I thought is to calculate the distance of each item of the data from the centroid of each cluster and choose as 'representative for a cluster X' the item with the minimum distance from the centroid of cluster X.
I have already read this useful answer
but I am having troubles adapting it to my case.
I was thinking of adding a column to the centroids-matrix to assign a name to the cluster (such as: a, b, c….) and then going on as the other answer suggests, but unfortunately I am not going anywhere.. just getting errors.
Best Answer
The obvious choice for a representative from the original data with k-means would of course be the object closest to the cluster center.
However, if you have this objective, you probably should be using PAM instead of k-means in the first place, because with PAM optimizes the deviation from a data point. Results by PAM are therefore expected to be better.