Regarding shopping cart analysis, I think that the main objective is to individuate the most frequent combinations of products bought by the customers. The association rules
represent the most natural methodology here (indeed they were actually developed for this purpose). Analysing the combinations of products bought by the customers, and the number of times these combinations are repeated, leads to a rule of the type ‘if condition, then result’ with a corresponding interestingness measurement. You may also consider Log-linear models
in order to investigate the associations between the considered variables.
Now as for clustering, here are some information that may come in handy:
At first consider Variable clustering
. Variable clustering is used for assessing collinearity, redundancy, and for separating variables into clusters that can be scored as a single variable, thus resulting in data reduction. Look for the varclus
function (package Hmisc in R)
Assessment of the clusterwise stability: function clusterboot
{R package fpc}
Distance based statistics for cluster validation: function cluster.stats
{R package fpc}
As mbq have mentioned, use the silhouette widths for assessing the best number of clusters. Watch this. Regarding silhouette widths, see also the optsil function.
Estimate the number of clusters in a data set via the gap statistic
For calculating Dissimilarity Indices and Distance Measures see dsvdis and vegdist
EM clustering algorithm can decide how many clusters to create by cross validation, (if you can't specify apriori how many clusters to generate). Although the EM algorithm is guaranteed to converge to a maximum, this is a local maximum and may not necessarily be the same as the global maximum. For a better chance of obtaining the global maximum, the whole procedure should be repeated several times, with different initial guesses for the parameter values. The overall log-likelihood figure can be used to compare the different final configurations obtained: just choose the largest of the local maxima.
You can find an implementation of the EM clusterer in the open-source project WEKA
This is also an interesting link.
Also search here for Finding the Right Number of Clusters in k-Means and EM Clustering: v-Fold Cross-Validation
Finally, you may explore clustering results using clusterfly
It is hard to answer your question without knowledge of how many samples you have and how many features you want, but here is a quick and dirty solution that may work.
Draw random pairs of samples from your set and compute a derived feature vector in {-1, 1}^200, with +1 in positions where the two samples are the same and -1 where the two samples are different. Assign a label +1 if the two samples are from the same cluster and -1 if they are from different clusters. Keep drawing pairs of samples until you have a sizable number. You will now have a labeled data set of training examples.
Now run a feature selection algorithm for classification (of which there are many) for this classification problem. You might start with a simple method like using lars to fit a regression model and using the indices of the non-zero coefficients to pick you features.
Best Answer
Latent class analysis is one possible approach.
Take the following probability distribution where A, B, and C can take on values of 1 or 0.
$P(A_i, B_j, C_k)$
If these were independent of each other, then we would expect to see:
$P(A_i, B_j, C_k)=P(A_i)P(B_j)P(C_k)$
Once this possiblity is eliminated, we might hypothesize that any observed dependency is due to values clustering within otherwise unobserved subgroups. To test this idea, we can estimate the following model:
$P(A_i, B_j, C_k)=P(X_n)P(A_i|X_n)P(B_j|X_n)P(C_k|X_n)$
Where $X$ is a latent categorical variable with $n$ levels. You specfy $n$, and the model parameters (marginal probabilities of class membership, and class specific probabilities for each variable) can be estimated via expectation-maximization.
In practice, you could estimate several models, with $5 \le n \le 10$, and "choose" the best model based on theory, likelihood based fit indices, and classification quality (which can be assessed by calculating posterior probabilities of class membership for the observations).
However, trying to identify meaningful patterns in 100 variables with 5-10 groups will likely require reducing that list down prior to estimating the model, which is a tricky enough topic in its own right (REF).