Solved – Cluster Analysis for large data in R

clusteringk-meanslarge datar

I am trying to perform a clustering analysis for a csv file with 50k+ rows, 10 columns. I tried k-mean, hierarchical and model based clustering methods. Only k-mean works because of the large data set. However, k-mean does not show obvious differentiations between clusters. So I am wondering is there any other way to better perform clustering analysis?

The data looks like this

Revenue  Employee  Longitude Latitude  LocalEmployee BooleanQuestions ...
1000     100       xxxx      xxxx      10
...                                                                   ...

Here is part of my code:

mydata <- scale(mydata)
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for(i in 2:15)wss[i]<- sum(fit=kmeans(mydata,centers=i,15)$withinss)
plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")

fit <- kmeans(mydata,7)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

Best Answer

Unless you have a good reason to believe that hierarchical (or other) clustering algorithms will work better for your specific application then k-means is probably a good place to start as it has computational advantages (as you have already discovered).

You didn't give a ton of background on what you have done from a data mining process standpoint, so you may have looked into these things already... but the first set of things that I would try are:

  • Feature selection: use your domain knowledge on the subject at hand to ensure you are including all of the attributes that are likely to be useful for your analysis, and exclude those that will only add noise.
  • Dimensionality Reduction: you may want to do PCA (or similar) and select only the top handful of dimensions that are less correlated. Potentially this could help you identify the relevant variables, avoid issues associated with the curse of dimensionality, and reduce the computation.
  • Feature normalization: you are measuring distances. If the attributes don't have a standardized unit of measure then you can get nonsensical results. For example, say you have revenue and employees as two of your input values. If you transformed your revenue data from dollars to euros, would your result change? If it does then you are likely missing a step in your process. The solution to that issue would be normalizing the data (e.g. calculate z-score or min-max normalization) and use that transformed data.
  • Outliers: k-means can be sensitive to outliers. You should validate that outliers aren't skewing your results. If they are then you may want to either transform the data to impose minimum/maximum values and/or consider excluding certain cases from your analysis.
  • Different values of K: one drawback of k-means is that you need to predetermine the number of clusters you want. You almost certainly want to experiment with a number of different values for K, and see what works best for your application.

If you still are not getting the results you want, then you may also want to consider:

  • Experimenting with different distance measures
  • Experimenting with different initial cluster centroids. You can get different results for the same dataset if you have different starting points.
  • Experimenting with other algorithms (e.g. hierarchical). Start with a random sample, and if the results are promising then figure out how to scale the full analysis on your system.
Related Question