Solved – use log-likelihood distance on data of only continuous variables

clusteringcontinuous datadistancespss

I have to run a SPSS two-step cluster analysis. All my 4 variables are continuous scalar standardized parameters (with normal distribution). The dataset includes 10,000 cases.

SPSS suggest to use euclidean distance with such a dataset, but the resuls are not significant (2 clusters: 99% and 1%), while using the log-likelihood distance option the clusters seem much more meaningful (both if I specify a fixed number of clusters and if I do not).

Question:

Which may be the reason of such a meaningless results with euclidean distance? maybe noise handling? And is it incorrect to use the log-likelihood distance even if my variables are all continuous?

Best Answer

You can use log-likelihood distance with variables all continuous; in fact it is the default.

It is difficult to say without the data why your euclidean results seem poor. Automatic detection of number of clusters with BIC or AIC criterions is probably somewhat more apt with log-likelihood distance because they are based on the same paradigm as it. With euclidean distance, I recommend you to specify various fixed number of clusters and check if the clusters are meaningful to you. Also, check if your 4 variables are highly correlated (two-step cluster method assumes no or weak correlation).