Solved – Is it necessary to normalize data for hierarchical clustering of mixed variables using complete linkage

clusteringhierarchical clusteringnormalization

I have a dataset with 3 numerical variables and 1 categorical variable which is binary (0,1). For clustering these data, should I normalize my numerical variables to the unit range (0,1) by x-minimum/range? Here's a sample from my data (there are over 3000 records):

Var1    Var2    var3    var4(binary)
12876   4545    7562    0
19244   678     6754    1
24636   2986    3459    1
14776   5672    7647    0
97752   23346   6935    1

I am using xlminer and R for this.

Best Answer

Transforming your data by subtracting the minimum from every value and dividing the differences by the range is often called normalizing. The transformed data will lie within the interval $[0, 1]$.

It is common to normalize all your variables before clustering. The fact that you are using complete linkage vs. any other linkage, or hierarchical clustering vs. a different algorithm (e.g., k-means) isn't relevant. The reason is that clustering algorithms all use a distance measure of some sort to determine if object $i$ is more likely to belong to the same cluster as object $j$ than the same cluster as object $k$. These distance measures are affected by the scale of the variables. That is, when computing the distance between two objects, each with a length and a mass, the distance will change dramatically if you change from, say, millimeters to kilometers. By putting all variables into the same range, you weight the variables equally.

You don't have to normalize your variables though. It just means that how close objects are will be more reflective of their values on one variable than another. For instance, using your example data, the ranges are:

             range     %
Var1         84876 0.760
Var2         22668 0.203
var3          4188 0.037
var4.binary.     1 0.000

Thus, without normalizing, almost all of the computed distance between two objects will be due to their values on Var1.

Best Answer

Related Solutions

Solved – Why are mixed data a problem for euclidean-based clustering algorithms

Solved – Ask for suggestions on clustering methods on a large dataset with mixed types of variables

Related Question