Solved – In cluster analysis should I scale (standardize) the data if variables are in the same units

clusteringk-meansmultivariate analysisstandardization

I am performing cluster analysis (k-means and hierarchical) based on multiple variables. Each variable is in percentage 0-100% and the sum of all variables is at most 100%.

I see that in many of the cluster analysis guides such as this one https://www.statmethods.net/advstats/cluster.html it is suggested to "rescale variables for comparability". I understand that this applies in the case where you have for example variables expressed in Kg and others in meters, which can be orders of magnitude different. This nice answer on SO https://stackoverflow.com/questions/5648383/how-to-apply-a-hierarchical-or-k-means-cluster-analysis-using-r explains that "scale (standardise) the data to allow each variable to be compared on a common scale. With data measured in different "units" or on different scales (as here with different means and variances) this is an important data processing step if the results are to be meaningful or not dominated by the variables that have large variances"

In my case I have the same units but should I also consider the variance of each variable as to determine if I should scale my data? Is it a good idea to standardize the data even if it's in the same units?

Best Answer

Some simple and obviuos, universal considerations for multivariate analysis, including clustering.

Case 1. Incomparable units. Height vs weight. You cannot compare, so the default decision is to standardize (equalize variances); it is "default" on the grounds of thought parsimony: "every unique aspect of nature is assumed to have same, unit variability of observations".

Case 2. Same units, irrelative features. Height vs circumference. These are clearly independent (conceptually, not statistically) phenomena of reality. Their same-unitness seems a coincidence. It would be silly to compare between the two values. The default decision is to standardize the features.

Case 3. Same units, juxtaposed features. Length of right arm vs of left arm. We could naturally compare the two lengths if we need so, they two are interchangeable, in a sense. The default decision is to leave variances as is (no matter how much they differ). Because "leave nature under study be how it is".

Case 4. Undecided whether 2 or 3. Length of arm vs length of leg. We could compare these but we are not interested in that, rather, we prefer to see the lengths as separate dimensions (albeit not irrelative phenomena). Feature-conceptual decision (whether standardize or leave) is impossible. Other, method-driven or goal-driven or criterial-driven$^1$ considerations would dictate the choice in a concrete situation. No default solution and the decision could be difficult to make. Some considerations might resolve the problem by providing an insight that the case is actually 2 or 3.

$^1$ By criterial-driven considerations I mean those engaged with a criterion, a meta-valuer which or who defines what value is "big" enough to be treated as opposite to "small" one. For example, in psychiatry the criterion is clinical populations and it is quite natural to standardize "psychopathological" features; in psychology the criterion is often a leading feature or a set of those, so standardizing, when not necessary, will just ruin inferences.