Solved – How to test whether the clustering of binary data is significant

binary dataclusteringstatistical significance

I'm doing shopping cart analyses my dataset is set of transaction vectors, with the items the products being bought.

When applying k-means on the transactions, I will always get some result. A random matrix would probably also show some clusters.

Is there a way to test whether the clustering I find is a significant one, or that is can be very well be a coincidence. If yes, how can I do it.

Best Answer

Regarding shopping cart analysis, I think that the main objective is to individuate the most frequent combinations of products bought by the customers. The association rules represent the most natural methodology here (indeed they were actually developed for this purpose). Analysing the combinations of products bought by the customers, and the number of times these combinations are repeated, leads to a rule of the type ‘if condition, then result’ with a corresponding interestingness measurement. You may also consider Log-linear models in order to investigate the associations between the considered variables.

Now as for clustering, here are some information that may come in handy:

At first consider Variable clustering. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction. Look for the varclus function (package Hmisc in R)

Assessment of the clusterwise stability: function clusterboot {R package fpc}

Distance based statistics for cluster validation: function cluster.stats {R package fpc}

As mbq have mentioned, use the silhouette widths for assessing the best number of clusters. Watch this. Regarding silhouette widths, see also the optsil function.

Estimate the number of clusters in a data set via the gap statistic

For calculating Dissimilarity Indices and Distance Measures see dsvdis and vegdist

EM clustering algorithm can decide how many clusters to create by cross validation, (if you can't specify apriori how many clusters to generate). Although the EM algorithm is guaranteed to converge to a maximum, this is a local maximum and may not necessarily be the same as the global maximum. For a better chance of obtaining the global maximum, the whole procedure should be repeated several times, with different initial guesses for the parameter values. The overall log-likelihood figure can be used to compare the different final configurations obtained: just choose the largest of the local maxima. You can find an implementation of the EM clusterer in the open-source project WEKA

This is also an interesting link.

Also search here for Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation

Finally, you may explore clustering results using clusterfly

Related Question