Solved – Find the optimal number of clusters in large dataset using R

k-meansr

I've a got a data which I did a PCA on. I want to do a kmeans on the individuals coordinates on the 5 first principal components. Therefore I have a 200000 x 5 matrix of the coordinates. I'm looking to find a way to determine the optimal number of cluster so I can run a kmeans on my coordinates data using R. I found many methods to do that using R (here is a list : https://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters). None of those methods have worked for me because my data is too large. I get an error like : "negative length vectors are not allowed". I really need help on that because I shouldn't decide what number of cluster I should use, I have to let the statistic decide. Thank you very much.

Best Answer

I actually solved my issue using the xmeans algorithm of ‘RWeka’ package. It's more relevant than kmeans, calculate automatically the number of clusters and run much faster than other methods. Here is a detailed mathematics description of the algorithm : https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf

And here is the package where you can find the xmeans algorithm : https://cran.r-project.org/web/packages/RWeka/RWeka.pdf

It took me a while to find such an efficient algorithm for my problem.

Related Question