Solved – How to standardize data for hierarchical clustering

clusteringdata transformationdistancenormalization

When running hierarchical clustering analysis of a matrix of individuals x samples (e.g., employee performances across different days), there are several possibilities for normalization. If one is clustering the columns (to see whether on certain days individuals perform similarly), one could

  1. z-score normalize across the rows to make each individual employees mean and standard deviation comparable across days, or

  2. z-score normalize across the columns to make all employees comparable within a day, or

  3. not normalize at all and cluster the raw values

Could someone explain the relative advantages/disadvantages of each approach here? To clarify, I am using correlation distance.

Methods 1 or 2 in practice give different results but it's not clear that for the task of seeing if days cluster together, whether #1 or #2 are more appropriate if one chooses to normalize.

Best Answer

If every observation is on the same scale (i.e. the same performance metric was used each day), then in general I would not recommend normalization, since the scores already have comparable location and spread information.

Since the Z-transform of a random variable is a linear transformation that does not change the sign of the variable, and since the correlation operation is invariant to linear transformations that preserve sign, z-transforming columns and then clustering on columns with correlation distance will not be different than using the raw scores (See http://www.math.uah.edu/stat/expect/Covariance.html just after #8). Z-normalizing rows (within employee) will wipe out their average performance, which may be highly inappropriate, depending on your goals.

If you are clustering days based on employee performance, presumably you would like a cluster of "high performing days", and a separate cluster of "low performing days." If this is the case, DO NOT use correlation distance, since it ignores mean differences. For example, correlation distance assigns a very low distance (=high similarity) to day 1 with scores (11,12,13,14,15) and day 2 with scores (2,2,3,4,5); and assigns a much higher distance to day 1 and day 3 with scores (12,13,12,13,12). This is probably not the sort of result you want. You probably want something like euclidean distance.

It is imperative that you think carefully about your goals here and select normalization methods and distance metrics (not to mention clustering algorithms) accordingly.

Related Question