Help understanding matrix math in whitening transformation proofs

I'm looking at a couple small articles about whitening transformations:

Background

https://theclevermachine.wordpress.com/2013/03/30/the-statistical-whitening-transform/
and
https://andrewcharlesjones.github.io/posts/2020/05/whitening/

In both articles, there comes a step where given a centered data matrix $X$

we compute its covariance

$$\Sigma = XX^T$$

and come up with a matrix $W$ that satisfies

$$WW^T = \Sigma^{-1}$$

The idea now is that if we transform our data $X$ into $Y = WX$ we can show that

$$cov(Y) = WX (WX)^T$$
$$= WXX^TW^T$$
$$= W\Sigma W^T$$

Issue

All of this seems reasonable so far, but both authors in the referenced articles make the following leap:
They claim you can reduce the above to $I$. In this article,

https://theclevermachine.wordpress.com/2013/03/30/the-statistical-whitening-transform/,

some of the work is sort of shown:

It is stated that $W\Sigma W^T$ = $WW^T\Sigma$ which would then obviously reduce to $I$.

Why is it OK to swap the order of $W^T$ and $\Sigma$ in the above expression?

Note: some of the matrix math in the article by andrew jones has a few matrix dimension mistakes. He is going to fix those. I believe though that what I have summarized here makes sense except for the last line in the proof. I am curious if that line is justified in some way that I don't see… And I have a suspicion that it's gone to be something that I'm just overlooking.

Longer answer

I will address the introduction of the Andy Jones article.

First, the original data matrix $X$ has to be centered. The column mean has to be subtracted from each element of the column.

Next, the article states that $cov(X)=X^TX$. If $X$ is centered, its covariance is $cov(X)=\frac{X^TX}{n-1}$.

The dimensions of $X^T\Sigma^{-1} X$ in the introduction make no sense. $X^T$ has dimensions p x n, $\Sigma^{-1}$ has dimensions p x p, and $X$ has dimensions n x p.

The author claims $W^TW=\Sigma^{-1}$. For this to be true, $W$ has dimensions m x p for some m. Now $Y=WX$ by definition. But $W$ has dimensions m x p and $X$ is a n x p matrix.

It does not make sense no matter how hard you try to play around with the dimensions. It is possible I have gotten it completely wrong. In which case, do let me know. But as I have tried to work it out, it doesn't work.

Best Answer

Short answer

If $\Sigma=XX^T$ then Sigma is nxn and X is nxp. Ok. Then if Y=WX then W is pxn and Y is pxp. Then cov(Y) is pxp. $W\Sigma W^T=WW^T\Sigma$ makes no sense because first is pxn nxn nxp and the second is pxn nxp nxn.

First, the original data matrix $X$ has to be centered. The column mean has to be subtracted from each element of the column.
Next, the article states that $cov(X)=X^TX$. If $X$ is centered, its covariance is $cov(X)=\frac{X^TX}{n-1}$.
The dimensions of $X^T\Sigma^{-1} X$ in the introduction make no sense. $X^T$ has dimensions p x n, $\Sigma^{-1}$ has dimensions p x p, and $X$ has dimensions n x p.
The author claims $W^TW=\Sigma^{-1}$. For this to be true, $W$ has dimensions m x p for some m. Now $Y=WX$ by definition. But $W$ has dimensions m x p and $X$ is a n x p matrix.
It does not make sense no matter how hard you try to play around with the dimensions. It is possible I have gotten it completely wrong. In which case, do let me know. But as I have tried to work it out, it doesn't work.

Conclusion

This article was not written to make mathematical sense. The Berkeley article is just as bad. I recommend not reading too much into the math in these two articles and try doing it on your own.

Best Answer

Short answer

Longer answer

Conclusion

Related Solutions

Computation of whitening matrix for estimation of the covariance matrix for n samples

Related Question