Solved – On feature scaling and weighting for clustering

The issue of feature scaling and weighting for cluster formation has been widely discussed in several books and papers as well as several questions (e.g. here ). To my understanting, variable range is the one to be considered as the weight of a variable that affects cluster formation, so as variables with large ranges dominate the ones with smaller ranges, particularly when using Euclidean distances. This is stated in Hastie et al (2011) and Kaufman & Rousseeuw (1990) among others. To overcome this problem, a form of feature scaling is suggested in order to balance the variables, so each variable can play and equal role in cluster formation. Min-max normalisation seems to be the most widely used scaling method in the literature.
However, since clustering is problem-dependent, variables considered to be more relevant in separating groups, should be assigned a higher influence factor (Hastie et al 2011). That is, in other words, more relevant variables should be assigned a different weight and consequently have larger range.

Considering the above, and to fit it to my problem, I consider one of my variables to be more relevant and consequently I want to assign a higher weight to it, so that it will influence cluster formation more than other variables. However, scalling all other variables to the same range and leaving the important one as it is, results into clusters dominated by the important variable only, since it has very large range. The most rational solution then, would be to reduce the range of the "important variable" as well, so as it would still influnce cluster formation, but to a degree where other variables will be considered as well. That is, for example, bring all variables to the same range (e.g. -1 to 1) and the important variable to a different range (e.g. 0 to 100) or bring all variables to the same range and then add or multiply the important variable with a constant.

Hence, my question is:
Is it "correct" to scale variables to different ranges before applying clustering? More important, if it is correct, what is the proper methodology to do it? Is it sensible to scale the variables using "random" ranges/weights?

Based on the literature, it is indeed correct to use different ranges/weight for variables as mentioned. However, I have been unable to identify any applications that make clear this methodology. Any references would me much appreciated.

Some references for clustering methodology, but not applications can be found here:

Hastie, Tibshirani, Friedman. 2011. The Elements of Statistical Learning

Kaufman L, Rousseeuw P. 1990. Finding Groups in Data – An Introduction to Cluster Analysis

Greenacre M, Primicerio R. 2015. Multivariate Analysis of Ecological Data

Best Answer

Min-max scaling as well as standardization often is not sufficient. Non-linear scaling may often be necessary to achieve the desired effects.

There is no "correct" way. Variable importance in an unsupervised context is a parameter that you have to choose.

The easiest way often (ignoring non-linear transformations for now) is to standardize variables, and then increase the weight of the variable you consider more important to 2x,3x, etc. until the results make most sense for you.

Minmax scaling is usually much worse than standardization, as it depends on the two most extreme values only, which tend to be outliers.

Best Answer

Related Solutions

Solved – Weighting variables in TwoStep cluster analysis

Solved – How Gower’s dissimilarity handle missing values in numeric columns

Related Question