I'm running a hierarchical clustering on a sample of data using the steps below:
library(RODBC)
setwd('D:/r/cluster2')
channel <- odbcConnectExcel('cluster.xls')
data <- sqlFetch(channel, 'clust9')
y9 <- data.frame(inf=data$infest, faible=data$faible, moyen=data$moyen, fort=data$fort, lon=data$Lon, lat=data$Lat)
y9 <- na.omit(y9) # listwise deletion of missing
y9.use <- y9
y9 <- scale(y9) # standardize variables
wss <- (nrow(y9)-1)*sum(apply(y9,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(y9, centers=i)$withinss)
plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")
# K-Means Cluster Analysis
fit <- kmeans(y9, 5) # 5 cluster solution
aggregate(y9,by=list(fit$cluster),FUN=mean)
y9 <- data.frame(y9, fit$cluster)
# Ward Hierarchical Clustering
d <- dist(y9, method = "euclidean") # distance matrix
fit <- hclust(d, method="ward")
plot(fit) # display dendogram
rect.hclust(fit, k=5, border="red")
and i got this results:
But when I did the same steps the next day I got different results:
They are not different in everything, but there are individual that they now belong to another cluster!
so I don't know why this behavior? i'm interested in interpreting and explaning the results, so when i get different results each time, that will make my previous interpretation wrong, what can I do for now ?
Best Answer
You are using the
kmeans
function, which will not give the same exact results every time you run it.The k-means algorithm works by using randomly chosen centroids as a starting point. These are generated using R pseudorandom number generator (PRNG).
The PRNG generates a series of random values which depend on a seed. From
?set.seed
:If you want to always obtain the same results you should impose a seed at the start of your script.
For instance:
Different seeds will give different results, but once you have fixed it it will always be the same.
Now, the fact that:
Is a good thing, it means that you can cluster most individuals with good confidence. Probably the ones that change between clusters are a bit "borderline".
One thing that you should do, however, is to set the
nstart
parameter inkmeans
. Settingnstart
to 10, for instance, will make the algorithm run 10 times, with 10 different starting sets of points and return the best fit (the one with the minimum within cluster sums-of-squares).This will help in reducing "bad clustering" due to an "unlucky" choice of starting points.
Finally, I am not completely sure what is the point of running
hclust
on thekmeans
results. Either runhclust
directly on the original data, or just show thekmeans
results.