Solved – Predicting SSE in k-means clustering

clusteringestimationk-meansmonte carlo

Given any number of clusters, is it possible to estimate the Sum of Squares Error (SSE) for the Clusters after adding noise to the clustering?

The type of noise generated will be supplied as a parameter. Any method(s) must be able to cater for Gaussian and Uniform noise.

Best Answer

After your latest comment I would opt for a Monte-Carlo estimate. What you would do is create the noise you want (a large number of times) randomly and then create an estimate of the expected value of the SSE after noise based on your results.

Some links that can get you started on Monte-Carlo simulation and estimates:


Edit

In response to the OP's comment.

What you do in order to get an estimate using the Monte Carlo is to actually add the amount of noise of the type you require an check the change in the SSE.

You repeat this again, and get another value for the change in the SSE.

You keep on repeating several times (perhaps a few thousands, maybe a few hundreds of thousands or even more) and you start to notice that the mean of the change in the SSE will start to converge to some value. This value is your expected mean.

Of course you can get more information than just the mean.

The advantages of using the Monte Carlo approach is that it is easy to understand (at least to the level you require) and implement while giving you good results and it is flexible enough to allow you to modify the initial problem and reuse (after performing the test again from scratch!).

The only downside is that you need to calculate the change in SSE many times - but with today's computational power I do not foresee that this should be a problem.