Solved – Is one-hot encoding and standardization of data equivalent to Gower’s distance

categorical-encodingclusteringdistanceeuclideangower-similarity

For clustering and other techniques for mixed data (numerical and categorical), Gower's distance is usually more preferred than Euclidean distance because the former computes distance differently for numerical data and categorical data.
For the numerical data, Gower's distance takes the normalized difference into account. For categorical data, it only registers if the category is the same or not (end result is 0 or 1).

My question is: if we use Euclidean distance on the data after standardizing the numerical features (i.e., normalizing each column with MinMax or StandardScaler) and one-hot encoding of the categorical features, is it equivalent to the Gower distance? Well, at least qualitatively, up to the standardization procedure?

Or I am missing something? Perhaps, it would make difference when categorical data has more than 2 categories and hence only detecting difference in 0 or 1 is not enough?

Thanks.

Best Answer

One hot encoding and then standardization puts much more weight on the categoricial variables. In particular, rare values will get a big distance. Gowers feels a bit more balanced to me. But in the end, both are just heuristics, and you should rather carefully craft a distance for your problem rather than blindly use either approach.

Let's take an example with a single categorical attribute.

Let's assume we have blondes 49%, brunettes 49% and red hair 2%. After one-hot encoding, 49% of data is (1,0,0) etc.

For the first two, the mean is close to 0.5, and the standard deviation is about 0.5, but for the ginger attribute, the mean is close to 0, and the standard deviation is just about 0.02. Thus, standard scaler will map these vectors to approximately (1,0,0), (0,1,0) and (0,0,50)!

The resulting Euclidean distance between blondes and brunettes is about sqrt(8), whereas it is about sqrt(2501) to gingers. You can see how standard scaling one-hot encoded variables overemphasizes rare values. No numeric attribute will get close to this weight - they are normalized for a standard deviation of 1, even the brunette-blonde difference is much larger.

If you only standardize or normalize the numerical attributes the effect is much less, but you still put more weight on the categoricial attributes, because the difference of (1,0,0) and (0,1,0) is still sqrt(2) and not 1 in Euclidean distance.

Furthermore, Gower uses Manhattan, not Euclidean, for a reason. The intuition is that Euclidean is better for coupled attributes where going diagonal is shorter, and Manhattan is for independent attributes. In Manhattan, one hot encoded variables get twice as much weight. Scamming them by 0.1 should get you close to Gower's but emphasizes how heuristic guesswork this all is...

Related Question