Solved – Understanding predictor importance in clustering with SPSS

clusteringpredictorspss

SPSS offers a certain metric to assess predictor or variable importance in clustering. How it is calculated has already been answered in the following thread:

"How is Relative Variable Importance computed in TwoStep Clustering in SPSS?"

The information provided by SPSS does not go beyond this:

Predictor Importance (cluster evaluation algorithms)
enter image description here

So far, I think I have understood that it uses the p-value (of an F-Test) of a given variable and contrasts it to the p-value of another variable used in clustering to determine its importance. I have several questions sorted in the following for clarity:

a) Why does this metric use the decimal logarithm (of p-values) and why is there a negative sign in front of it?

b) "The importance of field i": does field simply mean variable?

c) For the denominator: why does it take the maximum value from j∈Ω (so the smallest p-value of any other variable used except the variable from the nominator)?

d) "where Ω denotes the set of predictor and evaluation fields": What concretely do "set of predictor" and "evaluation fields" mean? Why doesn't it say "set of predictors", in the sense of "number of predictors"?

e) What is the "minimal double value"? Is this an established term?

f) This study uses predictor importance and sets a cutoff level at 0.4 (and then clusters again only with the variables >=0.4). Is there any rule that tells us above which level variables are important? Or is the 0.4 chosen arbitrarily as a question of taste?

Best Answer

a) Many of most importance measures are standardized or normalized so that they range from 0 to 1 or 0 to 100. Some are set so that they sum to 1 or 100, and some so that the largest value is 1 or 100. This measure is set to range from 0 to 1, with the maximum value for any predictor set to 1. The use of p values as a beginning was designed to allow some comparability of categorical and scale or "continuous" predictors. The base 10 logarithmic transformation was chosen for utility in spreading out the p values. The negative is then required to make resulting raw values positive, though if it were neglected it would cancel out in the numerator and denominator of the ratios used to calculate the final values.

b) Yes, field simply means variable.

c) It actually takes the largest value of the transformed p value, even if it's that same one (which results in a ratio of 1 for the largest transformed value or smallest p value). This is to normalize the values to have a maximum value of 1, as noted above.

d) Predictors are the variables actually used in the cluster analysis. In TWOSTEP CLUSTER you can optionally specify additional variables or fields as evaluation fields, and these are included in the computation of importance values if they're specified.

e) Yes, this is the smallest positive normalized double precision floating point number, 2-1022, approximately 2.225 x 10-308. Yes, it's a standard known value, though references to it may have different names. See, for example, https://en.cppreference.com/w/cpp/types/numeric_limits/min, where it's referred to as DOUBLE_MIN.

f) I suspect that's pretty much arbitrary. I don't know of any reason to use that value in general, unless being 40% of the largest is important.