Solved – TSS returned by K means clustering is always the same

clusteringk-meansr

I have high dimensional ($m \approx 2k$), high sample (n=140,000) dataset in R that I load into memory run PCA on it (returns $m \approx 400$ components to cover 95% of variance) then I run k means clustering on this dataset. However, even with wildly different number of clusters (from 1 to 1000) I always get the same total sum of squares. Though the cluster assigned to the datapoints I have inspected seem reasonable at first sight (they change seemingly appropriately with the number of clusters.)

So codewise:

trans = preProcess(train, method=c('BoxCox','center','scale','pca'))
train_pc = predict(trans, train)
kNumbers = c(1,5,10,15,20,25,100,1000)
for (i in kNumbers) {
    model = kmeans(train_pc, centers=i, nstart=10)
    cat(model$totss + '\n')
}

Things I have tried:

  • Playing with iter_max from 10 to 100 (then usually no warning messages about lack of convergence)
  • Increasing the n_start (1 to 10)
  • Preprocessing the data BoxCox, center, scale

Ultimately, this is just a part of feature engineering so if there is a way to get some generic way of informing the subsequent algorithm of some sort of idea of "big picture" closeness, then I am all up for that as well.

Any ideas?

Best Answer

Total sum of squares is literally the total, every point to every other point. Unless you change the data set, this is not expected to change.

What you are interested in is WCSS, within-cluster sum-of-squares.

Note that this value is expected to be decreasing with the number of clusters. Don't use it for estimating the number of clusters (instead, visualize, analyze, and evaluate your data; don't trust any single number statistic)