Solved – Predicting SSE in k-means clustering

clusteringestimationk-meansmonte carlo

Given any number of clusters, is it possible to estimate the Sum of Squares Error (SSE) for the Clusters after adding noise to the clustering?

The type of noise generated will be supplied as a parameter. Any method(s) must be able to cater for Gaussian and Uniform noise.

Best Answer

After your latest comment I would opt for a Monte-Carlo estimate. What you would do is create the noise you want (a large number of times) randomly and then create an estimate of the expected value of the SSE after noise based on your results.

Some links that can get you started on Monte-Carlo simulation and estimates:

Edit

In response to the OP's comment.

What you do in order to get an estimate using the Monte Carlo is to actually add the amount of noise of the type you require an check the change in the SSE.

You repeat this again, and get another value for the change in the SSE.

You keep on repeating several times (perhaps a few thousands, maybe a few hundreds of thousands or even more) and you start to notice that the mean of the change in the SSE will start to converge to some value. This value is your expected mean.

Of course you can get more information than just the mean.

The advantages of using the Monte Carlo approach is that it is easy to understand (at least to the level you require) and implement while giving you good results and it is flexible enough to allow you to modify the initial problem and reuse (after performing the test again from scratch!).

The only downside is that you need to calculate the change in SSE many times - but with today's computational power I do not foresee that this should be a problem.

Related Solutions

Solved – Validity Index Pseudo F for K-Means Clustering

Psuedo F describes the ratio of between cluster variance to within-cluster variance. If Psuedo F is decreasing, that means either the within-cluster variance is increasing or staying static (denominator) or the between cluster variance is decreasing (numerator).

Within cluster variance really just measures how tight your clusters fit together. The higher the number the more dispersed the cluster, the lower the number the more focused the cluster. Between cluster variance measures how seperated clusters are from each other.

K-means objective is to minimize within cluster variance (necessarily maximizing between cluster variance). So the way that you can interpret this is: as the number of clusters increase the within cluster variance increases, making the actual clusters more dispersed/less compact and therefore less effective (and potentially closer to other clusters).

With that said, all of your interpretations are possible. But before going ahead and writing off k-means, you should try looking at the elbow method (plot # of clusters vs. between-group variance divided by total group variance) - if there's no elbow in the plot it's usually a good sign that k-means won't provide useful results (or at least that's my litmus test).

Here's an example of the elbow method using R code (from http://www.statmethods.net/advstats/cluster.html). Where "mydata" is your data.

# Determine number of clusters
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for (i in 2:20) wss[i] <- sum(kmeans(mydata,centers=i)$withinss)
plot(1:20, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")

Solved – k-means vs k-means++

K-means starts with allocating cluster centers randomly and then looks for "better" solutions. K-means++ starts with allocation one cluster center randomly and then searches for other centers given the first one. So both algorithms use random initialization as a starting point, so can give different results on different runs. As an example you can check this lecture: Clustering As An Example Inference Problem, around 40th minute there are examples of k-means runs, but whole lecture is interesting.

So, answering your questions:

No, because there is a random initialization different runs can give different results (see examples in the lecture). They should give comparable results but this is not guaranteed. Also, as all the centers are initialized randomly in k-means, it can give different results than k-means++.
K-means can give different results on different runs.
The k-means++ paper provides monte-carlo simulation results that show that k-means++ is both faster and provides a better performance, so there is no guarantee, but it may be better.

As about your problem: what k-means++ does it chooses the centers and then starts a "classic" k-means. So what you can do is (1) use the part of algorithm that chooses centers and then (2) use those centers in the GPU implementations of k-means. This way at least a part of a problem is solved on GPU-based software, so should be faster.

Best Answer

Related Solutions

Solved – Validity Index Pseudo F for K-Means Clustering

Solved – k-means vs k-means++

Related Question