Solved – Cross-validation for k-means clustering in R

cross-validationk-meansmachine learningr

I have a dataset of two columns (we can call them x and y).
I understand that for cross-validation I need to split my data into k partitions, and for that the general consensus is that I use createFolds. However, I am a little confused about the output. It seems like I can't use createFolds on my entire dataset, it seems like I have to specify something like createFolds(data$x, k=5). Then I get an output of 5 folds with a row of numbers, and I haven't been able to find documentation on what the output means exactly.

My second question is how to apply cross-validation in R to my test dataset? As I understand it, in R all I need to do to perform k-means clustering on is to specify kmeans(data, centres,…). So if I apply that command to my held-out dataset, how can I use that for cross validation on my test dataset? Thank you!

Best Answer

I think it would help if you would describe more what your actual goal is.

In supervised learning, there is always a clear definition of error that quantifies how well $f(X)$ is an approximation of $y$. We can then use cross validation to obtain an estimate o the out-of-sample (or generalization) error of a model. This error we can use to do mode selection.

In unsupervised learning, such as clustering, there is usually no clear definition of error. Due to this, also cross-validation cannot be used for this purpose.

However, there are some methods that determine the quality of a clustering via its stability. Here, stability is assessed via cross-validation or bootstrap schemes, see e.g.

Ben-Hur, Asa, Andre Elisseeff, and Isabelle Guyon. "A stability based method for discovering structure in clustered data." Pacific symposium on biocomputing. Vol. 7. 2001.

Tibshirani, Robert, and Guenther Walther. "Cluster validation by prediction strength." Journal of Computational and Graphical Statistics 14.3 (2005): 511-528.

Fang, Yixin, and Junhui Wang. "Selection of the number of clusters via the bootstrap method." Computational Statistics & Data Analysis 56.3 (2012): 468-477.

or

Haslbeck, J., & Wulff, D. U. (2016). Estimating the Number of Clusters via Normalized Cluster Instability. arXiv preprint arXiv:1608.07494.

which contains a summary of these approaches.

Related Question