Solved – How to select the best number of clusters in cluster analysis in SPSS

clusteringspss

When I used SAS for cluster analysis, I used to use some plots of CCC, pseudo F and pseudo T^2 indices to help determine best the number of clusters. Not sure about this in SPSS, not familiar with SPSS.

in SPSS, there is a TwoStep clustering, which can help determine the number of clusters (it first performs a hierarchical method to define the number of clusters). Are there other better methods or indicators of number of clusters in SPSS?

Best Answer

Bear in mind that CV is not intended to be a resource for software-specific questions. Given that, I don't know SPSS either but, having done my share of clustering, may still be able to provide some useful, general guidelines. As with all unsupervised, exploratory methods, there is typically no "ground truth" against which to validate the results. Use statistical metrics and common sense to derive solutions that are actionable.

The two-step process is generating seeds in step one for input into the second, k-means step. Does SPSS provide any options for filtering those seeds? For instance, being able to set a minimum seed size would eliminate outliers or splinter seeds and help to stabilize the results.

Next, I've found that playing around with the number of predictors used by the cluster algorithm can be hugely important in generating useful results. Since k-means assumes continuously distributed inputs as well as OLS estimation (i.e., it's not scale invariant), it is typically a good idea to pass the raw features through a PCA or EFA to reduce redundancy and smooth the information. Then, adjusting the number of components used by eliminating the smaller eigenvalued factors can sometimes clarify the resulting partitions. For instance, if your EFA returns 8 factors and you don't get useful results running the algorithm on all of them, try dropping the lowest loading components.

Evaluate the lumpiness of the solution, i.e., the frequencies of the grouping. For instance, solutions with clusters containing much more than 40% of your data are probably not giving good results.

If SPSS provides some sort of summary metrics like pseudo-rsquares, then run different solutions that request sequential numbers of clusters on the same inputs, e.g., 3 to 30 clusters. Find the inflection point at which those summary metrics "roll over" and stop growing from one iteration to the next. Use that as a starting point for a deeper dive.

At this point, you've can triangulate to a solution you like. Once you’ve got a solution that you like, try to validate it. There are a number of ways of doing this. For instance, you want solutions that "replicate." One method for this is to employ a train and test split of the data to see if the clusters are recoverable based on the misclassification error rate in cross-validation. There are different rules of thumb about this error rate that go as low as slightly better than random assignment (50% error) for weakly predictive models. Another answer is to make a judgment as to whether the resulting segments “feel” real or are representative, modal profiles of the space being clustered. Of course, this is a highly subjective and qualitative decision.