Solved – Choosing clusters for k-means: the 1 cluster case

clusteringk-meansr

Does anyone know a good method to determine if clustering using kmeans is even appropriate? That is, what if your sample is actually homogenous? I know something like a mixture model (via mclust in R) will provide fit statistics for the 1:k cluster case, but it seems like all of the techniques to evaluate kmeans requires at least 2 clusters.

Does anyone know of a technique to compare the 1 and 2 cluster cases for kmeans?

Best Answer

The gap statistic is a great way of doing this; Tibshirani, Hastie & Walther (2001).

http://stat.ethz.ch/R-manual/R-devel/library/cluster/html/clusGap.html - The relevant R package.

The idea is that it performs a sequential hypothesis test of clustering your data for K=1,2,3,... vs a null hypothesis of random noise, which is equivalent to one cluster. Its particular strength is that it gives you a reliable indication of whether K=1, i.e. whether there are no clusters.

Here's an example, I was inspecting some astronomy data a few days ago as it happens - namely from a transiting exoplanet survey. I wanted to know what evidence there are for (convex) clusters. My data is 'transit'

library(cluster)
cgap <- clusGap(transit, FUN=kmeans, K.max=kmax, B=100)
for(k in 1:(kmax-1)) {
    if(cgap$Tab[k,3]>cgap$Tab[(k+1),3]-cgap$Tab[(k+1),4]) {print(k)}; 
    break;
}

With the gap statistic you're looking for the first value of K where the test 'fails' i.e. the gap statistic significantly dips. The loop above will print such a k, however simply plotting cgap gives you the following figure:
enter image description here See how there's a significant dip in the Gap from k=1 to k=2, that signifies there are in fact no clusters (i.e. 1 cluster).

Related Solutions

Solved – K-means clustering for usage profiling

Have you validated your results in any way?

It seems that you want to do unsupervised classification. That usually doesn't really work too well, in particular for this kind of data and with this method. K-means is more a vector quantization method than meant to find how clusters are separated. I.e. it will - always - discretize your data into $k$ groups, even when there is no separating gap inbetween!

A negative value means that the record is more similar to the records of its neighboring cluster than to other members of its own cluster.

This seems to be what is happening here. K-means breaks apart data that should be in the same cluster.

But my larger concern is that your data may be inappropriate for k-means. K-means minimizes the within-cluster-sum-of-squares (WCSS). But given that your axes are from different domains, they do not necessarily have the same scale. K-means implicitely assumes squared Euclidean distance (which is the sum-of-squares) and this may be an inappropriate measure of similarity for your data, in particular without extensive preprocessing. You could try the following approach:

define an appropriate measure of similarity for your data. Spend a lot of effort here!
use metric learning techniques (e.g. non-metric multidimensional scaling) to obtain a vector space where Euclidean distance is appropriate
run k-means in this projected data
to assign a new observation to the clusters, apply the same preprocessing as in 1), then the same projection as in 2) and then assign it to the nearest mean in 3)

A common failure with k-means is to run it on your data without first checking that this is appropriate; that the dimensions bear the same amount of relevant information on the same scale. The simplest heuristic is to use whitening but more often than not (e.g. when having discrete or binary attributes) this will not be enough.

But even with all these efforts, k-means may still fail badly. Because it assumes clusters have the same "diameter". So if one of your users has a very narrow usage profile (always using the webbrowser only, with a single tab), and the other has a very wide usage (word, browser, email, ...) all open at the same time or not, k-means may just be based on the wrong assumptions: clusters in k-means are expected to have the same diameter.

R Time-Series – Fitting Mixture of Distributions to Time-Series Data

There is a misunderstanding in your question that needs a correction. Time-series model is not univariate since you have two variables: actual values and time. To provide an example let's take a time-series data, say woolyrnq data from forecast R library (plotted below).

Now, if you use univariate Mclust to find clusters it will ignore the time component and find two clusters.

----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust V (univariate, unequal variance) model with 2 components:

 log.likelihood   n df     BIC       ICL
      -984.6021 119  5 -1993.1 -2002.634

Clustering table:
 1  2 
84 35

We can also plot the density of fitted clusters:

If you look at the x-axis of this plot, you'll learn that it is related to values of your data (y-axis on the first plot), not to time. Now, if we color the point-values of the time series by cluster assignments, it will be more clear:

The method discovered clusters of "high" and "low" values, independent of time. The same applies to the eight clusters discovered by Mclust with your data - they ignore the time, so are unrelated to the peaks marked by you on the second plot in your question.

If you want to use Mclust for such data, you need to use a bivariate model including time. For example, with the woolyrnq data you can obtain two such clusters

fit2 <- Mclust(data.frame(x = woolyrnq, y = time(woolyrnq)))
plot(x, col = fit2$classification)

Or illustrated as 2-dimmensional density plot:

As you can see, now the clusters relate to the "higher" wool production in Australia up to the early 1970' and "lower" production afterwards. Notice that this is a bivariate, rather than univariate, model. The plot from the paper that you refer to is a marginalized version of such multidimensional density plot and can be easily obtained by extracting mean and variance objects from parameters in Mclust object (example below).

# densities are multiplied by arbitrary constants to fit the y-axis
curve(dnorm(x, fit2$parameters$mean[2, 2], fit2$parameters$variance$sigma[2,2,2])*1e5, add = F, col="green", from = 1965, to = 1995, ylim = c(2000, 8000), xlab = "time", ylab = "woolyrnq")
curve(dnorm(x, fit2$parameters$mean[2, 1], fit2$parameters$variance$sigma[2,2,1])*5e5, add = T, col="red", from = 1965, to = 1995)
lines(as.numeric(time(woolyrnq)), as.numeric(woolyrnq))

The plot above, if expanded a little bit, could be also a very good example of why using such method is not really the best way to go with time series, what would get obvious if you look at the plot below.

As you can see, if you made predictions from such mixture model, you'll conclude that there were literally no wool production in Australia before 1850 and there would be no such production in ninety years from now. Time series are not really Gaussian shaped, so such methods should be used with caution.

R note: In the example provided ts object was used, where information about time units was available by the time method. However if you are not using a ts object, than you have to use additional variable that describes the time with appropriate time units.

Best Answer

Related Solutions

Solved – K-means clustering for usage profiling

R Time-Series – Fitting Mixture of Distributions to Time-Series Data

Related Question