Solved – Are there cases where there is no optimal k in k-means

clusteringk-meansmachine learning

This has been inside my mind for at least a few hours. I was trying to find an optimal k for the output from the k-means algorithm (with a cosine similarity metric) so I ended up plotting the distortion as a function of the number of clusters. My dataset is a collection of 800 documents in a 600-dimensional space.

From what I understand, finding the knee point or the elbow point on this curve should tell me at least approximately the number of clusters I need to put my data into. I put the graph below. The point at which the red vertical line was drawn was obtained by using the maximum second derivative test. After doing all this, I was stuck at something much simpler: what does this graph tell me about the dataset?

Does it tell me that it is not worth clustering and that my documents lack structure or that I need to set a very high k? One strange thing though is that even with low k, I am seeing similar documents being clustered together so I am not sure why I am getting this curve. Any thoughts?

enter image description here

Best Answer

In most situations, I would have thought that dsuch a plot basically means that there is no cluster structure in the data. However, clustering in very high dimensions such as this is tricky as for the Euclidean distance metric all distances tend to the same as the number of dimensions increases. See this Wikipedia page for references to some papers on this topic. In short, it may just be the high-dimensionality of the dataset that is the problem.

This is essentially "the curse of dimensionality", see this Wikipedia page as well.

A paper that may be of interest is Sanguinetti, G., "Dimensionality reduction of clustered datsets", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30 no. 3, pp. 535-540, March 2008 (www). Which is a bit like an unsupervised version of LDA that seeks out a low-dimensional space that emphasises the cluster structure. Perhaps you could use that as a feature extraction method before performing k-means?

Related Question