Solved – Validity Index Pseudo F for K-Means Clustering

clusteringk-meansvalidity

The Validity Index "Pseudo F" is described as:

(between-cluster-sum-of-squares / (c-1)) / (within-cluster-sum-of-squares / (n-c))

with c beeing the number of clusters and n beeing the number of ovservations. It measures the seperation between all the clusters and should be high.

I want to apply K-Means for a large dataset. (about 3000 observations with about 200 values each).

When computing the Pseudo F for say cluster sizes 2 till 20, it decreases continously with the increase in the number of clusters.

How should I interpret these values?
Possibilities might be.

  • The validity index is inappropriate for this type of clustering
  • K-Means is not very applicable for this dataset
  • The clustering solution with 2 clusters it the best

WSS of cluster:

enter image description here

Best Answer

Psuedo F describes the ratio of between cluster variance to within-cluster variance. If Psuedo F is decreasing, that means either the within-cluster variance is increasing or staying static (denominator) or the between cluster variance is decreasing (numerator).

Within cluster variance really just measures how tight your clusters fit together. The higher the number the more dispersed the cluster, the lower the number the more focused the cluster. Between cluster variance measures how seperated clusters are from each other.

K-means objective is to minimize within cluster variance (necessarily maximizing between cluster variance). So the way that you can interpret this is: as the number of clusters increase the within cluster variance increases, making the actual clusters more dispersed/less compact and therefore less effective (and potentially closer to other clusters).

With that said, all of your interpretations are possible. But before going ahead and writing off k-means, you should try looking at the elbow method (plot # of clusters vs. between-group variance divided by total group variance) - if there's no elbow in the plot it's usually a good sign that k-means won't provide useful results (or at least that's my litmus test).

Here's an example of the elbow method using R code (from http://www.statmethods.net/advstats/cluster.html). Where "mydata" is your data.

# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")