I've a got a data which I did a PCA on. I want to do a kmeans on the individuals coordinates on the 5 first principal components. Therefore I have a 200000 x 5 matrix of the coordinates. I'm looking to find a way to determine the optimal number of cluster so I can run a kmeans on my coordinates data using R. I found many methods to do that using R (here is a list : https://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters). None of those methods have worked for me because my data is too large. I get an error like : "negative length vectors are not allowed". I really need help on that because I shouldn't decide what number of cluster I should use, I have to let the statistic decide. Thank you very much.
Solved – Find the optimal number of clusters in large dataset using R
k-meansr
Related Solutions
I did not grasp question 1 completely, but I'll attempt an answer. The plot of Q1 shows how the within sum of squares (wss) changes as cluster number changes. In this kind of plots you must look for the kinks in the graph, a kink at 5 indicates that it is a good idea to use 5 clusters.
WSS has a relationship with your variables in the following sense, the formula for WSS is
$\sum_{j} \sum_{x_i \in C_j} ||x_i - \mu_j||^2$
where $\mu_j$ is the mean point for cluster $j$ and $x_i$ is the $i$-th observation. We denote cluster j as $C_j$. WSS is sometimes interpreted as "how similar are the points inside of each cluster". This similarity refers to the variables.
The answer to question 2 is this. What you are actually watching in the clusplot()
is the plot of your observations in the principal plane. What this function is doing is calculating the principal component score for each of your observations, plotting those scores and coloring by cluster.
Principal component analysis (PCA) is a dimension reduction technique; it "summarizes" the information of all variables into a couple of "new" variables called components. Each component is responsible of explaining certain percentage of the total variability. In the example you read "This two components explain 73.95% of the total variability".
The function clusplot()
is used to identify the effectiveness of clustering. In case you have a successful clustering you will see that clusters are clearly separated in the principal plane. On the other hand, you will see the clusters merged in the principal plane when clustering is unsuccessful.
For further reference on principal component analysis you may read wiki. if you want a book I suggest Modern Multivariate Techniques by Izenmann, there you will find PCA and k-means.
Hope this helps :)
k-means++ is not meant to improve the accuracy.
What k-means++ is meant to improve is the starting conditions, making k-means to be more likely to converge to a reasonably good local optimum, and faster than with random initialization; largely by ensuring that the cluster centers are not too close to each other initially.
Still, k-means++ means to preserve randomness, and you are expected to try multiple runs and keep the best result. In which case you cannot expect k-means++ to produce better results than k-means. You can only expect to get them faster (in computation), and with fewer tries.
Best Answer
I actually solved my issue using the xmeans algorithm of ‘RWeka’ package. It's more relevant than kmeans, calculate automatically the number of clusters and run much faster than other methods. Here is a detailed mathematics description of the algorithm : https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf
And here is the package where you can find the xmeans algorithm : https://cran.r-project.org/web/packages/RWeka/RWeka.pdf
It took me a while to find such an efficient algorithm for my problem.