Solved – Cluster Analysis for large data in R

clusteringk-meanslarge datar

I am trying to perform a clustering analysis for a csv file with 50k+ rows, 10 columns. I tried k-mean, hierarchical and model based clustering methods. Only k-mean works because of the large data set. However, k-mean does not show obvious differentiations between clusters. So I am wondering is there any other way to better perform clustering analysis?

The data looks like this

Revenue  Employee  Longitude Latitude  LocalEmployee BooleanQuestions ...
1000     100       xxxx      xxxx      10
...                                                                   ...

Here is part of my code:

mydata <- scale(mydata)
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for(i in 2:15)wss[i]<- sum(fit=kmeans(mydata,centers=i,15)$withinss)
plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")

fit <- kmeans(mydata,7)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

Best Answer

Unless you have a good reason to believe that hierarchical (or other) clustering algorithms will work better for your specific application then k-means is probably a good place to start as it has computational advantages (as you have already discovered).

You didn't give a ton of background on what you have done from a data mining process standpoint, so you may have looked into these things already... but the first set of things that I would try are:

Feature selection: use your domain knowledge on the subject at hand to ensure you are including all of the attributes that are likely to be useful for your analysis, and exclude those that will only add noise.
Dimensionality Reduction: you may want to do PCA (or similar) and select only the top handful of dimensions that are less correlated. Potentially this could help you identify the relevant variables, avoid issues associated with the curse of dimensionality, and reduce the computation.
Feature normalization: you are measuring distances. If the attributes don't have a standardized unit of measure then you can get nonsensical results. For example, say you have revenue and employees as two of your input values. If you transformed your revenue data from dollars to euros, would your result change? If it does then you are likely missing a step in your process. The solution to that issue would be normalizing the data (e.g. calculate z-score or min-max normalization) and use that transformed data.
Outliers: k-means can be sensitive to outliers. You should validate that outliers aren't skewing your results. If they are then you may want to either transform the data to impose minimum/maximum values and/or consider excluding certain cases from your analysis.
Different values of K: one drawback of k-means is that you need to predetermine the number of clusters you want. You almost certainly want to experiment with a number of different values for K, and see what works best for your application.

If you still are not getting the results you want, then you may also want to consider:

Experimenting with different distance measures
Experimenting with different initial cluster centroids. You can get different results for the same dataset if you have different starting points.
Experimenting with other algorithms (e.g. hierarchical). Start with a random sample, and if the results are promising then figure out how to scale the full analysis on your system.

Related Solutions

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

I'm not sure if anyone is looking at this question any more but I put your question in to rjags to test Tom's Gibbs sampling suggestion while incorporating insight from Guy about the flat prior for standard deviation.

This toy problem might be difficult because 10 and even 40 data points are not enough to estimate variance without an informative prior. The current prior σzi∼Uniform(0,100) is not informative. This might explain why nearly all draws of μzi are the expected mean of the two distributions. If it does not alter your question too much I will use 100 and 400 data points respectively.

I also did not use the stick breaking process directly in my code. The wikipedia page for the dirichlet process made me think p ~ Dir(a/k) would be ok.

Finally it is only a semi-parametric implementation since it still takes a number of clusters k. I don't know how to make an infinite mixture model in rjags.

markov chain mu cluster 1

markov chain mu cluster 2

library("rjags")

set1 <- rnorm(100, 0, 1)
set2 <- rnorm(400, 4, 1)
data <- c(set1, set2)

plot(data, type='l', col='blue', lwd=3,
     main='gaussian mixture model data',
     xlab='data sample #', ylab='data value')
points(data, col='blue')

cpd.model.str <- 'model {
  a ~ dunif(0.3, 100)
  for (i in 1:k){
    alpha[i] <- a/k
    mu[i] ~ dnorm(0.0, 0.001)
    sigma[i] ~ dunif(0, 100)
  }
  p[1:k] ~ ddirich(alpha[1:k])
  for (i in 1:n){
    z[i] ~ dcat(p)
    y[i] ~ dnorm(mu[z[i]], pow(sigma[z[i]], -2))
  }
}' 


cpd.model <- jags.model(textConnection(cpd.model.str),
                        data=list(y=data,
                                  n=length(data),
                                  k=5))
update(cpd.model, 1000)
chain <- coda.samples(model = cpd.model, n.iter = 1000,
                      variable.names = c('p', 'mu', 'sigma'))
rchain <- as.matrix(chain)
apply(rchain, 2, mean)

Solved – Was it as valid to perform k-means on a distance matrix as on data matrix (text mining data)

To understand how the kmeans() function works, you need to read the documentation and/or inspect the underlying code. That said, I am sure it does not take a distance matrix without even bothering. You could write your own function to do k-means clustering from a distance matrix, but it would be an awful hassle.

The k-means algorithm is meant to operate over a data matrix, not a distance matrix. It only minimizes squared Euclidean distances (cf. Why does k-means clustering algorithm use only Euclidean distance metric?). It is only sensible when you could have Euclidean distances as a meaningful distance metric. This has always been the case since the algorithm was invented, but few people seem to be aware of this, with the result that k-means is probably the most mis-used algorithm in machine learning.

Euclidean distance doesn't make any sense for sparse categorical data (text mining), so I wouldn't even try anything like this. You first need to figure out what distance metric is appropriate for your data (@ttnphns explains some possible measures here: What is the optimal distance function for individuals when attributes are nominal?). Then you can compute the distance matrix and use a clustering algorithm that can operate over one (e.g., k-medians / PAM, various hierarchical algorithms, etc.).

Best Answer

Related Solutions

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

Solved – Was it as valid to perform k-means on a distance matrix as on data matrix (text mining data)

Related Question