Hierarchical vs TwoStep Clustering – Best Methods for Binary Data

binary dataclusteringspss

(This question is an edited version of a question I previously posted which one user recommended would benefit from more focus).

I have 2000 questionnaires from respondents which ask 33 different questions about which issues are present in their lives – i.e. alcohol abuse, domestic violence, mental health, child abuse, learning difficulties etc.

Each question can only be answered yes/no (which I've re-coded as 1/0).

I'd like to use this dataset to start creating n profiles of respondents to define which variables naturally cluster together e.g. (alcohol abuse and domestic violence), (mental health, child abuse, domestic violence), (alcohol abuse, learning difficulties) across some/all of the 33 differnt variables.

A note I've read on-line indicates that hierarchical cluster analysis is not appropriate for a dataset of this scale/type due to sensitivity of the position of how data is sorted in the dataset, and recommends two-step cluster analysis instead.

Consequently, I'd be really interested in your input on whether hierarchical, two-step or other methods are most appropriate for exploring clusters of responses that natually associate together using a binary dataset.

Best Answer

1) The tech support reply that you link to and which reads that hierarchical clustering is less appropriate for binary data than two-step clustering is, is incorrect for me.

It is true that when there is a substantial amount of distances between objects which are not of unique value ("tied" or "duplicate" distances) - which is quite expectable a situation with any few-valued discrete data, not only binary data, - the results of clustering will strongly depend on the order of processing of the objects. But this scandal accompanies any clustering method, any method directly or indirectly basing itself on some distance/similarity measure. If there are some ties in a quantity which determine clusters - that can show up, as unstable solutions. The unstability caused by ties is thus natural and cannot be an argument against this or that method potentially suffering from it.

In the particular case of the linked note, you can make certain that two-step cluster method will also - like hierarchical method - give from time to time different results under different sort order of the observations in the provided dataset. So, I don't see any advantage of one method over the other in that respect.

2) Hierarchical cluster is well suited for binary data because it allows to select from a great many distance functions invented for binary data and theoretically more sound for them than simply Euclidean distance. However, some methods of agglomeration will call for (squared) Euclidean distance only. Here's a few of points to remember about hierarchical clustering.

One important issue about binary/dichotomous data when selecting a similarity function is whether your data is ordinal binary (asymmetric categories: present vs absent) or nominal binary (symmetric categories: this vs that) for you. In other words, should 0-0 match be a ground of similarity or not? (You may want to read answers like this, this.)

3) Two-step cluster method of SPSS could be used with binary/dichotomous data as an alternative to hierarchical (and to some other) methods (some related answers this, this). However, two-step's processing of categorical variables employs log-likelihood distance which is right for nominal, not "ordinal binary" categories. So, if you treat your data as the latter, you have problems. Treating the variables as quantitative (interval) won't solve it. In some specific cases it is possible to convert a number of binary features into one or more multinomial nominal features quite effectively; in general, it would be quite a tricky task to do it without losing information. An experienced analyst may experiment with optimal scaling techniques and multiple correspondence analysis to see if multiple binary features can be well replaced by a smaller number of equivalent quantitative ones.

Best Answer

Related Solutions

Solved – Weighting variables in TwoStep cluster analysis

Solved – Analysis of hierarchical clustered hospital data

Related Question