Solved – Why do deep learning practitioners forego PCA for ZCA

data transformationdeep learningmachine learningpca

I have an understanding of PCA and ZCA, read a similar question on the subject which, unfortunately, does not have the specific answer to my question.

I understand the benefits of data whitening: specifically, standardizing the dynamic range of each data feature, which is very important when using stochastic gradient descent. What I fail to understand is opting to use ZCA and foregoing the benefit of having de-correlate your features.

I understand that it is more appealing to the human eye, but aren't we making the job of generalizing the data harder for the learning algorithm?

Best Answer

A big benefit for ZCA is that whitened data is still a picture in the same space as the original. If you ZCA whiten a photo of a cat, it's still cat-like. This is helpful for other techniques searching for nonlinear structure. You're able to take an $n\times n$ patch from a picture and apply a filter to it with the belief that the pixels will exhibit certain useful dependencies by virtue of being neighbours. E.g. is there an eye in this patch? Is there fur in this patch? The same is emphatically not true of PCA, which completely disregards the spatial structure of images.

Second, contrary to your statement, ZCA does decorrelate data. The development in Bell 1997 - equations 5 and 8 - makes this a requirement of the technique. Take the covariance matrix $\bf \Sigma$ and use eigendecomposition to form the whitening matrix $\bf W_z = U D^{-1/2} U^T$. Then for some new $\bf x$ drawn from the distribution we have $Cov(\bf W_zx, \bf W_zx $$)=\bf W_z$$Cov(\bf x,x$)$\bf W_z^T= I$.