Solved – Inputs to k-means are often normalized per-feature. Why not fully whiten the data instead

data miningk-meansmachine learningnormalization

We often normalize inputs to the k-means algorithm by 1) subtracting the mean on a per-feature basis and 2) dividing by the standard deviation on a per-feature basis. Some rational behind this is discussed here:

Are mean normalization and feature scaling needed for k-means clustering?

But it seems strange to assume that the features aren't correlated, so my question is, why don't we fully whiten the data instead? In other words, if the data has mean $\mu$ and covariance $\Sigma$, why not preprocess each sample using $ \tilde{x} = \Sigma^{-1/2} (x – \mu) $ ?

The only argument I see is that this will get computationally difficult when the dimensionality of $x$ gets very large, for example $\gg 100$, but are there other reasons?

Thank you.

Best Answer

That is a good question, and it really makes a lot of sense to perform whitening of data. In particular there is a paper by Coates and Ng where they discuss the difference among these two steps (among other things) and their effect on k-means in order to learn some dictionary of patches of images.

As you already point out, whitening decorrelates data (up to a second order, since you are using the covariance matrix). Whitening presents some numerical problems (as well as normalization), and the standard trick is to add some constant term (regularization) when performing scalings.

Related Question