Hierarchical Clustering with Mixed Data Types – Best Distance/Similarity Measures

clusteringdistance-functionsgower-similaritymixed type datasimilarities

In my dataset we have both continuous and naturally discrete variables. I want to know whether we can do hierarchical clustering using both type of variables. And if yes, what distance measure is appropriate?

Best Answer

One way is to use Gower similarity coefficient which is a composite measure$^1$; it takes quantitative (such as rating scale), binary (such as present/absent) and nominal (such as worker/teacher/clerk) variables. Later Podani$^2$ added an option to take ordinal variables as well.

The coefficient is easily understood even without a formula; you compute the similarity value between the individuals by each variable, taking the type of the variable into account, and then average across all the variables. Usually, a program calculating Gower will allow you to weight variables, that is, their contribution, to the composite formula. However, proper weighting of variables of different type is a problem, no clear-cut guidelines exist, which makes Gower or other "composite" indices of proximity pull ones face.

The facets of Gower similarity ($GS$):

  • When all variables are quantitative (interval) then the coefficient is the range-normalized Manhattan distance converted into similarity. Because of the normalization variables of different units may be safely used. You should not, however, forget about outliers. (You might also decide to normalize by another measure of spread than range.) Because of the said normalization by a statistic, such as range, which is sensitive to the composition of individuals in the dataset Gower similarity between some two individuals may change its value if you remove or add some other individuals in the data.
  • When all variables are ordinal, then they are first ranked, and then Manhattan is computed, as above with quantitative variables, but with the special adjustment for ties.
  • When all variables are binary (with an asymmetric significance of categories: "present" vs "absent" attribute) then the coefficient is the Jaccard matching coefficient (this coefficient treats when both individuals lack the attribute as neither match nor mismatch).
  • When all variables are nominal (also including here dichotomous with symmetric significance: "this" vs "that") then the coefficient is the Dice matching coefficient that you obtain from your nominal variables if recode them into dummy variables (see this answer for more).

(It is easy to extend the list of types. For example, one could add a summand for count variables, using normalized chi-squared distance converted to similarity.)

The coefficient ranges between 0 and 1.

"Gower distance". Without ordinal variables present (i.e. w/o using the Podani's option) $\sqrt{1-GS}$ behaves as Euclidean distance, it fully supports euclidean space. But $1-GS$ is only metric (supports triangular inequality), not Euclidean. With ordinal variables present (using the Podani's option) $\sqrt{1-GS}$ is only metric, not Euclidean; and $1-GS$ isn't metric at all. See also.

With euclidean distances (distances supporting Euclidean space), virtually any classic clustering technique will do. Including K-means (if your K-means program can process distance matrices, of course) and including Ward's, centroid, median methods of Hierarchical clustering. Using K-means or other those methods based on Euclidean distance with non-euclidean still metric distance is heuristically admissible, perhaps. With non-metric distances, no such methods may be used.

The previous paragraph talks about if K-means or Ward's or such clustering is legal or not with Gower distance mathematically (geometrically). From the measurement-scale ("psychometric") point of view one should not compute mean or euclidean-distance deviation from it in any categorical (nominal, binary, as well as ordinal) data; therefore from this stance you just may not process Gower coefficient by K-means, Ward etc. This viewpoint warns that even if a Euclidean space is present it may be granulated, not smooth (see related).

If you want all the formulae and additional info on Gower similarity / distance, please read the description of my SPSS macro !gower; it's in the Word document found in collection "Various proximities" on my web-page.


$^1$ Gower J. C. A general coefficient of similarity and some of its properties // Biometrics, 1971, 27, 857-872

$^2$ Podani, J. Extending Gower’s general coefficient of similarity to ordinal characters // Taxon, 1999, 48, 331-340