Solved – cluster analysis after factor analysis: do I need to use all factors for cluster analysis

clusteringfactor analysishierarchical clustering

I have a 127-question survey with 6-level likert type answers. With EFA I have kept 56 items and got 8 factors. With CFA (on sample not used in EFA) I confirmed these factors. so far all good.

When I tried to do cluster analysis, with 8 all factors I did not get clear solution (I used SAS, and used the CCC and pseudo F and T statistics indicators to judge the number of clusters; ccc: Cubic Clustering Criterion).

When I used 7 factors, I got a clearly solution of 3 clusters. All three indicators (CCC, pseudo F and statistics) suggested cluster number of 3. And further analysis with 3 clusters looks very reasonable to us.

my question is: Do I must use all 8 factors from EFA/CFA to do cluster analysis?

If I must use all factors, what can I do if it does suggest a clear number of clusters? It seems there is no much to tune.

if I use 7 factors but not 8 factors from EFA, will this be a problem and reviewers may question on this?

Best Answer

The quick answer is "no," you do not need to use all of the factors. More specifically, there is no "rule" or law about what you eventually use in creating a cluster solution. Moreover, each of the factors is an expression of all of the input variables.

Having done many cluster solutions over the years that were based on SAS Proc Fastclus, I found that there were three important input options:

  • The "drift" option

  • The "delete=" option setting a minimum floor to the size of the seeds

  • Varying the number of input variables

If all the factors don't return a useful solution, then whack a few with the smallest eigenvalues and rerun it. Another useful idea is to put the first pass of Fastclus within a macro which loops from 3 to some appropriately large number of clusters, i.e., maxc=. Then, if a promising solution is found based on triangulating the CCC, the frequencies of the clusters and the inflection point for the pseudo-Rsquared, roll that out based on the original input variables.

Related Question