Solved – How is Relative Variable Importance computed in TwoStep Clustering in SPSS

clusteringimportancerspss

In SPSS, the user can check the relative variable importance in a clustering result and produce a graph like the following:

enter image description here

link:
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp?topic=%2Fcom.ibm.spss.statistics.help%2Fidh_twostep_main.htm

Then, we can identify variables dominating predictor importance or having the most impact at determining clusters. Does anyone know how importance is computed here? I would like to implement this metric in R, if it is not already there.

Best Answer

I don't know the true exact answer but can offer a likely one.

In latest versions of SPSS Statistics command TWOSTEP CLUSTER the visual cluster descriptions, comparison and variable importance assessment was incorporated right into the command. In earlier releases (such as ver. 17, for instance) the similar task/output was carried out by a separate command AIM. That command still exists in SPSS (see Command Syntax Reference) so you could use it.

Here is the looks of the older version dialog box. Note the button Plots which I've pressed.

enter image description here

Dialog "Plots" corresponds to AIM. Here is the syntax of Iris dataset clustering with default settings plus the "Plots" specifications I did on the pic above:

TWOSTEP CLUSTER
  /CONTINUOUS VARIABLES=SLength SWidth PLength PWidth
  /DISTANCE LIKELIHOOD
  /NUMCLUSTERS AUTO 15 BIC
  /HANDLENOISE 0
  /MEMALLOCATE 64
  /CRITERIA INITHRESHOLD(0) MXBRANCH(8) MXLEVEL(3)
  /PLOT VARCHART COMPARE BYVAR NONPARAMETRIC 
  /PRINT COUNT SUMMARY
  /SAVE VARIABLE=TSC_7469.
AIM  TSC_7469
  /CONTINUOUS SLength SWidth PLength PWidth
  /PLOT ERRORBAR IMPORTANCE(X=VARIABLE Y=PVALUE)
  /CRITERIA ADJUST=BONFERRONI  SHOWREFLINE=NO HIDENOTSIG=NO.

The syntax says the four Iris data numeric variables to be the base of clustering with automatic selection of the number of clusters, and saving variable TSC_7469 with cluster membership (cluster labels). Then the for variables and the cluster result variable are picked by AIM which produces plots, among them are plots showing variable importance after I ran the analysis.

Here what were the plots of variable importance:

enter image description here

There is two plots, one for each cluster produced. Current (my SPSS ver. 22) TWOSTEP CLUSTER analysis, not using AIM anymore, produced this plot:

enter image description here

Notice how much this picture is like the precedeing graphs, it seems to be as if the averaged picture of those two.

If that is indeed true then we may conclude that modern TWOSTEP CLUSTER command has computed the "variable importance" of the scale variables this way: it performed t-tests or ANOVAs and used the to plot the -log10(p-values), rescaled so that the greatest observed value is 1, as the "variable importance" index. This or something very similar approach.

Note that you can change settings in AIM to different (see dialog box above, as well as Command Syntax Reference).

Related Question