Solved – Why do we divide by the standard deviation and not some other standardizing factor before doing PCA

machine learningmathematical-statisticspca

I was reading the following justification (from cs229 course notes) on why we divide the raw data by its standard deviate:

enter image description here

even though I understand what the explanation is saying, it is not clear to me why dividing by the standard deviation would achieve such a goal. It says so that everyone is more on the same "scale". However, its not entirely clear why dividing by the standard deviation achieves that. Like, whats wrong with dividing by the variance? Why not some other quantity? Like…the sum of absolute values? or some other norm… Is there a mathematical justification for choose the STD?

Are the claims in this extract a theoretical statement that can be derived/proved through mathematics (and/or statistics) or is it more one of those statement that we do because it seems to work in "practice"?

Basically, can one provide either a rigorous mathematical explanation of why that intuition is true? Or if its just an empirical observation, why we think that works in general before doing PCA?

Also, in the context of PCA, is this the process of standardizing or normalizing?

Some other thoughts I had that might "explain" why the STD:

Since PCA can be derived from maximizing the variance, I guessed that dividing by a related quantity such as the STD, might be one of the reasons we divided by the STD. But then I considered that maybe if we defined maybe a "variance" with any other norm, $\frac{1}{n} \sum^{n}_{i=1} (x_i -\mu)^p$, then we would divide by the STD of that norm (by taking the pth root or something). Though, it was just a guess and I am not 100% about this, hence the question. I was wondering if anyone knew anything related to this.

I did see that there was maybe a related question:

PCA on correlation or covariance?

but it seemed talk more about when to use "correlation" or "covariance" but lacked rigorous or convincing or detailed justifications, which is what I am mainly interested in.

Same for:

Why do we need to normalize data before analysis

"Normalizing" variables for SVD / PCA

Best Answer

This is in partial answer to "it is not clear to me why dividing by the standard deviation would achieve such a goal". In particular, why it puts the transformed (standardized) data on the "same scale". The question hints at deeper issues (what else might have "worked", which is linked to what "worked" might even mean, mathematically?), but it seemed sensible to at least address the more straightforward aspects of why this procedure "works" - that is, achieves the claims made for it in the text.

The entry on row $i$ and column $j$ of a covariance matrix is the covariance between the $i^{th}$ and $j^{th}$ variables. Note that on a diagonal, row $i$ and column $i$, this becomes the covariance between the $i^{th}$ variable and itself - which is just the variance of the $i^{th}$ variable.

Let's call the $i^{th}$ variable $X_i$ and the $j^{th}$ variable $X_j$; I'll assume these are already centered so that they have mean zero. Recall that $$Cov(X_i, X_j) =\sigma_{X_i} \, \sigma_{X_j} \, Cor(X_i, X_j)$$

We can standardize the variables so that they have variance one, simply by dividing by their standard deviations. When standardizing we would generally subtract the mean first, but I already assumed they are centered so we can skip that step. Let $Z_i = \frac{X_i}{\sigma_{X_i}}$ and to see why the variance is one, note that

$$Var(Z_i) = Var\left(\frac{X_i}{\sigma_{X_i}}\right) = \frac{1}{\sigma_{X_i}^2}Var(X_i) = \frac{1}{\sigma_{X_i}^2} \sigma_{X_i}^2 = 1$$

Similarly for $Z_j$. If we take the entry in row $i$ and column $j$ of the covariance matrix for the standardized variables, note that since they are standardized:

$$Cov(Z_i, Z_j) =\sigma_{Z_i} \, \sigma_{Z_j} \, Cor(Z_i, Z_j) = Cor(Z_i, Z_j)$$

Moreover when we rescale variables in this way, addition (equivalently: subtraction) does not change the correlation, while multiplication (equivalently: division) will simply reverse the sign of the correlation if the factor (divisor) is negative. In other words correlation is unchanged by translations or scaling but is reversed by reflection. (Here's a derivation of those correlation properties, as part of an otherwise unrelated answer.) Since we divided by standard deviations, which are positive, we see that $Cor(Z_i, Z_j)$ must equal $Cor(X_i, X_j)$ i.e. the correlation between the original data.

Along the diagonal of the new covariance matrix, note that we get $Cov(Z_i, Z_i) = Var(Z_i) = 1$ so the entire diagonal is filled with ones, as we would expect. It's in this sense that the data are now "on the same scale" - their marginal distributions should look very similar, at least if they were roughly normally distributed to start with, with mean zero and with variance (and standard deviation) one. It is no longer the case that one variable's variability swamps the others. You could have divided by a different measure of spread, of course. The variance would have been a particularly bad choice due to dimensional inconsistency (think about what would have happened if you'd changed the units one of your variables was in, e.g. from metres to kilometres). Something like median absolute deviation (or an appropriate multiple of the MAD if you are trying to use it as a kind of robust estimator of the standard deviation) may have been more appropriate. But it still won't turn that diagonal into a diagonal of ones.

The upshot is that a method that works on the covariance matrix of standardized data, is essentially using the correlation matrix of the original data. For which you'd prefer to use on PCA, see PCA on correlation or covariance?

Best Answer

Related Solutions

Solved – Why is optimizing a mixture of Gaussian directly computationally hard

Solved – What can be the reason to do feature selection based on variance before doing PCA

Related Question