Solved – Different hierarchical clustering results

clusteringdata miningr

I'm running a hierarchical clustering on a sample of data using the steps below:

library(RODBC)

setwd('D:/r/cluster2')
channel <- odbcConnectExcel('cluster.xls')
data <- sqlFetch(channel, 'clust9')

y9 <- data.frame(inf=data$infest, faible=data$faible, moyen=data$moyen, fort=data$fort, lon=data$Lon, lat=data$Lat)

y9 <- na.omit(y9) # listwise deletion of missing
y9.use <- y9
y9 <- scale(y9) # standardize variables

wss <- (nrow(y9)-1)*sum(apply(y9,2,var))
for (i in 2:15) wss[i] <- sum(kmeans(y9, centers=i)$withinss)

plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares")

# K-Means Cluster Analysis
fit <- kmeans(y9, 5) # 5 cluster solution

aggregate(y9,by=list(fit$cluster),FUN=mean)

y9 <- data.frame(y9, fit$cluster)

# Ward Hierarchical Clustering

d <- dist(y9, method = "euclidean") # distance matrix

fit <- hclust(d, method="ward") 

plot(fit) # display dendogram

rect.hclust(fit, k=5, border="red")

and i got this results:

enter image description here

But when I did the same steps the next day I got different results:
enter image description here

They are not different in everything, but there are individual that they now belong to another cluster!

so I don't know why this behavior? i'm interested in interpreting and explaning the results, so when i get different results each time, that will make my previous interpretation wrong, what can I do for now ?

Best Answer

You are using the kmeans function, which will not give the same exact results every time you run it.

The k-means algorithm works by using randomly chosen centroids as a starting point. These are generated using R pseudorandom number generator (PRNG).

The PRNG generates a series of random values which depend on a seed. From ?set.seed:

Initially, there is no seed; a new one is created from the current time (and since R 2.14.0, the process ID) when one is required. Hence different sessions will give different simulation results, by default. However, the seed might be restored from a previous session if a previously saved workspace is restored.

If you want to always obtain the same results you should impose a seed at the start of your script.

For instance:

set.seed(12345)

Different seeds will give different results, but once you have fixed it it will always be the same.

Now, the fact that:

They are not different in everything, but there are individual that they now belong to another cluster!

Is a good thing, it means that you can cluster most individuals with good confidence. Probably the ones that change between clusters are a bit "borderline".

One thing that you should do, however, is to set the nstart parameter in kmeans. Setting nstart to 10, for instance, will make the algorithm run 10 times, with 10 different starting sets of points and return the best fit (the one with the minimum within cluster sums-of-squares).

This will help in reducing "bad clustering" due to an "unlucky" choice of starting points.

Finally, I am not completely sure what is the point of running hclust on the kmeans results. Either run hclust directly on the original data, or just show the kmeans results.