PCA vs Autoencoders – Differences Explained in Machine Learning

autoencodersmachine learningneural networkspca

Both PCA and autoencoder can do demension reduction, so what are the difference between them? In what situation I should use one over another?

Best Answer

PCA is restricted to a linear map, while auto encoders can have nonlinear enoder/decoders.

A single layer auto encoder with linear transfer function is nearly equivalent to PCA, where nearly means that the $W$ found by AE and PCA won't necessarily be the same - but the subspace spanned by the respective $W$'s will.

Sparse coding

Sparse coding minimizes the objective $$ \mathcal{L}_{\text{sc}} = \underbrace{||WH - X||_2^2}_{\text{reconstruction term}} + \underbrace{\lambda ||H||_1}_{\text{sparsity term}} $$ where $W$ is a matrix of bases, H is a matrix of codes and $X$ is a matrix of the data we wish to represent. $\lambda$ implements a trade of between sparsity and reconstruction. Note that if we are given $H$, estimation of $W$ is easy via least squares.

In the beginning, we do not have $H$ however. Yet, many algorithms exist that can solve the objective above with respect to $H$. Actually, this is how we do inference: we need to solve an optimisation problem if we want to know the $h$ belonging to an unseen $x$.

Auto encoders

Auto encoders are a family of unsupervised neural networks. There are quite a lot of them, e.g. deep auto encoders or those having different regularisation tricks attached--e.g. denoising, contractive, sparse. There even exist probabilistic ones, such as generative stochastic networks or the variational auto encoder. Their most abstract form is $$ D(d(e(x;\theta^r); \theta^d), x) $$ but we will go along with a much simpler one for now: $$ \mathcal{L}_{\text{ae}} = ||W\sigma(W^TX) - X||^2 $$ where $\sigma$ is a nonlinear function such as the logistic sigmoid $\sigma(x) = {1 \over 1 + \exp(-x)}$.

Similarities

Note that $\mathcal{L}_{sc}$ looks almost like $\mathcal{L}_{ae}$ once we set $H = \sigma(W^TX)$. The difference of both is that i) auto encoders do not encourage sparsity in their general form ii) an autoencoder uses a model for finding the codes, while sparse coding does so by means of optimisation.

For natural image data, regularized auto encoders and sparse coding tend to yield very similar $W$. However, auto encoders are much more efficient and are easily generalized to much more complicated models. E.g. the decoder can be highly nonlinear, e.g. a deep neural network. Furthermore, one is not tied to the squared loss (on which the estimation of $W$ for $\mathcal{L}_{sc}$ depends.)

Also, the different methods of regularisation yield representations with different characteristica. Denoising auto encoders have also been shown to be equivalent to a certain form of RBMs etc.

But why?

If you want to solve a prediction problem, you will not need auto encoders unless you have only little labeled data and a lot of unlabeled data. Then you will generally be better of to train a deep auto encoder and put a linear SVM on top instead of training a deep neural net.

However, they are very powerful models for capturing characteristica of distributions. This is vague, but research turning this into hard statistical facts is currently conducted. Deep latent Gaussian models aka Variational Auto encoders or generative stochastic networks are pretty interesting ways of obtaining auto encoders which provably estimate the underlying data distribution.

Solved – What are the differences between filters learned in autoencoder and convolutional neural network

In case of CNN filters are applied to small patches of an image at each possible location (which also makes them translation invariant).

Autoencoder's hidden layers get whole image (output of the previous layer) as their input, which doesn't look like a good idea for images: usually only spatially local features correlate, whereas more distant ones are less correlated. Also, these hidden neurons are not translation invariant.

Thus, CNNs are like usual ANNs with a special kind of regularization, which zeros out most of weights to make use of locality.

Best Answer

Related Solutions

Sparse Coding vs Autoencoder – Key Differences in Machine Learning

Sparse coding

Auto encoders

Similarities

But why?

Solved – What are the differences between filters learned in autoencoder and convolutional neural network

Related Question