Solved – Why do deep learning practitioners forego PCA for ZCA

data transformationdeep learningmachine learningpca

I have an understanding of PCA and ZCA, read a similar question on the subject which, unfortunately, does not have the specific answer to my question.

I understand the benefits of data whitening: specifically, standardizing the dynamic range of each data feature, which is very important when using stochastic gradient descent. What I fail to understand is opting to use ZCA and foregoing the benefit of having de-correlate your features.

I understand that it is more appealing to the human eye, but aren't we making the job of generalizing the data harder for the learning algorithm?

Best Answer

A big benefit for ZCA is that whitened data is still a picture in the same space as the original. If you ZCA whiten a photo of a cat, it's still cat-like. This is helpful for other techniques searching for nonlinear structure. You're able to take an $n\times n$ patch from a picture and apply a filter to it with the belief that the pixels will exhibit certain useful dependencies by virtue of being neighbours. E.g. is there an eye in this patch? Is there fur in this patch? The same is emphatically not true of PCA, which completely disregards the spatial structure of images.

Second, contrary to your statement, ZCA does decorrelate data. The development in Bell 1997 - equations 5 and 8 - makes this a requirement of the technique. Take the covariance matrix $\bf \Sigma$ and use eigendecomposition to form the whitening matrix $\bf W_z = U D^{-1/2} U^T$. Then for some new $\bf x$ drawn from the distribution we have $Cov(\bf W_zx, \bf W_zx $$)=\bf W_z$$Cov(\bf x,x$)$\bf W_z^T= I$.

Related Solutions

Solved – Where and why does deep learning shine

The main purported benefits:

(1) Don't need to hand engineer features for non-linear learning problems (save time and scalable to the future, since hand engineering is seen by some as a short-term band-aid)

(2) The learnt features are sometimes better than the best hand-engineered features, and can be so complex (computer vision - e.g. face-like features) that it would take way too much human time to engineer.

(3) Can use unlabeled data to pre-train the network. Suppose we have 1000000 unlabeled images and 1000 labeled images. We can now drastically improve a supervised learning algorithm by pre-training on the 1000000 unlabeled images with deep learning. In addition, in some domains we have so much unlabeled data but labeled data is hard to find. An algorithm that can use this unlabeled data to improve classification is valuable.

(4) Empirically, smashed many benchmarks that were only seeing incremental improvements until the introduction of deep learning methods.

(5) Same algorithm works in multiple areas with raw (perhaps with minor pre-processing) inputs.

(6) Keeps improving as more data is fed to the network (assuming stationary distributions etc).

Solved – Deep Learning for Regression

For standard regression I recommend using Multi-Layer Perceptron (MLP) with Mean Square Error (MSE) loss function. The model can be defined in Keras as follows:

model = Sequential()
model.add(Dense(20, input_dim=13, init='normal', activation='relu'))
model.add(Dense(10, init='normal', activation='relu'))
model.add(Dense(1, init='normal'))

model.compile(loss='mean_squared_error', optimizer='adam')

See this article for a complete example. For a TensorFlow implementation, consider the following code.

Best Answer

Related Solutions

Solved – Where and why does deep learning shine

Solved – Deep Learning for Regression

Related Question