I am trying to perform a clustering analysis for a csv file with 50k+ rows, 10 columns. I tried k-mean, hierarchical and model based clustering methods. Only k-mean works because of the large data set. However, k-mean does not show obvious differentiations between clusters. So I am wondering is there any other way to better perform clustering analysis?
The data looks like this
Revenue Employee Longitude Latitude LocalEmployee BooleanQuestions ...
1000 100 xxxx xxxx 10
... ...
Here is part of my code:
mydata <- scale(mydata)
wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))
for(i in 2:15)wss[i]<- sum(fit=kmeans(mydata,centers=i,15)$withinss)
plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")
fit <- kmeans(mydata,7)
clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)
Best Answer
Unless you have a good reason to believe that hierarchical (or other) clustering algorithms will work better for your specific application then k-means is probably a good place to start as it has computational advantages (as you have already discovered).
You didn't give a ton of background on what you have done from a data mining process standpoint, so you may have looked into these things already... but the first set of things that I would try are:
If you still are not getting the results you want, then you may also want to consider: