In my dataset we have both continuous and naturally discrete variables. I want to know whether we can do hierarchical clustering using both type of variables. And if yes, what distance measure is appropriate?
Hierarchical Clustering with Mixed Data Types – Best Distance/Similarity Measures
clusteringdistance-functionsgower-similaritymixed type datasimilarities
Best Answer
One way is to use Gower similarity coefficient which is a composite measure$^1$; it takes quantitative (such as rating scale), binary (such as present/absent) and nominal (such as worker/teacher/clerk) variables. Later Podani$^2$ added an option to take ordinal variables as well.
The coefficient is easily understood even without a formula; you compute the similarity value between the individuals by each variable, taking the type of the variable into account, and then average across all the variables. Usually, a program calculating Gower will allow you to weight variables, that is, their contribution, to the composite formula. However, proper weighting of variables of different type is a problem, no clear-cut guidelines exist, which makes Gower or other "composite" indices of proximity pull ones face.
The facets of Gower similarity ($GS$):
(It is easy to extend the list of types. For example, one could add a summand for count variables, using normalized chi-squared distance converted to similarity.)
The coefficient ranges between 0 and 1.
"Gower distance". Without ordinal variables present (i.e. w/o using the Podani's option) $\sqrt{1-GS}$ behaves as Euclidean distance, it fully supports euclidean space. But $1-GS$ is only metric (supports triangular inequality), not Euclidean. With ordinal variables present (using the Podani's option) $\sqrt{1-GS}$ is only metric, not Euclidean; and $1-GS$ isn't metric at all. See also.
With euclidean distances (distances supporting Euclidean space), virtually any classic clustering technique will do. Including K-means (if your K-means program can process distance matrices, of course) and including Ward's, centroid, median methods of Hierarchical clustering. Using K-means or other those methods based on Euclidean distance with non-euclidean still metric distance is heuristically admissible, perhaps. With non-metric distances, no such methods may be used.
The previous paragraph talks about if K-means or Ward's or such clustering is legal or not with Gower distance mathematically (geometrically). From the measurement-scale ("psychometric") point of view one should not compute mean or euclidean-distance deviation from it in any categorical (nominal, binary, as well as ordinal) data; therefore from this stance you just may not process Gower coefficient by K-means, Ward etc. This viewpoint warns that even if a Euclidean space is present it may be granulated, not smooth (see related).
If you want all the formulae and additional info on Gower similarity / distance, please read the description of my SPSS macro
!gower
; it's in the Word document found in collection "Various proximities" on my web-page.$^1$ Gower J. C. A general coefficient of similarity and some of its properties // Biometrics, 1971, 27, 857-872
$^2$ Podani, J. Extending Gower’s general coefficient of similarity to ordinal characters // Taxon, 1999, 48, 331-340