Sparse coding is defined as learning an over-complete set of basis vectors to represent input vectors (<– why do we want this) . What are the differences between sparse coding and autoencoder? When will we use sparse coding and autoencoder?
Sparse Coding vs Autoencoder – Key Differences in Machine Learning
autoencodersdeep learningmachine learningneural networksunsupervised learning
Related Solutions
The autoencoder
package is just an implementation of the autoencoder described in Andrew Ng's class notes, which might be a good starting point for further reading. Now, to tackle your questions
People sometimes distinguish between *parameters*, which the learning algorithm calculates itself, and *hyperparameters*, which control that learning process and need to be provided to the learning algorithm. **It is important to realise that there are NO MAGIC VALUES** for the hyperparameters. The optimal value will vary, depending on the data you're modeling: you'll have to try them on your data.
a) Lambda ($\lambda$) controls how the weights are updated during backpropagation. Instead of just updating the weights based on the difference between the model's output and the ground truth), the cost function includes a term which penalizes large weights (actually the squared value of all weights). Lambda controls the relative importance of this penalty term, which tends to drag weights towards zero and helps avoid overfitting.
b) Rho ($\rho)$ and beta $(\beta$) control sparseness. Rho is the expected activation of a hidden unit (averaged across the training set). The representation will become sparser and sparser as it becomes smaller. This sparseness is imposed by adjusting the bias term, and beta controls the size of its updates. (It looks like $\beta$ actually just rescales the overall learning rate $\alpha$.)
c) Epsilon ($\epsilon)$ controls the initial weight values, which are drawn at random from $N(0, \epsilon^2)$.
Your rho values don't seem unreasonable since both are near the bottom of the activation function's range (0 to 1 for logistic, -1 to 1 for tanh). However, this obviously depends on the amount of sparseness you want and the number of hidden units you use too.
LeCunn's major concern with small weights that the error surface becomes very flat near the origin if you're using a symmetric sigmoid. Elsewhere in that paper, he recommends initializing with weights randomly drawn from a normal distribution with zero mean and $m^{-1/2}$ standard deviation, where $m$ is the number of connections each unit receives.
There are lots of "rules of thumb" for choosing the number of hidden units. Your initial guess (2x input) seems in line with most of them. That said, these guesstimates are much more guesswork than estimation. Assuming you've got the processing power, I would err on the side of more hidden units, then enforce sparseness with a low rho value.
One obvious use of autoencoders is to generate more compact feature representations for other learning algorithms. A raw image might have millions of pixels, but a (sparse) autoencoder can re-represent that in a much smaller space. [Geoff Hinton][2] (and others) have shown that they generate useful features for subsequent classification. Some of the deep learning work uses autoencoders or similar to pretrain the network. [Vincent et al.][3] use autoencoders directly to perform classification.
The ability to generate succinct feature representations can be used in other contexts as well. Here's a neat little project where autoencoder-produced states are used to guide a reinforcement learning algorithm through Atari games.
Finally, one can also use autoencoders to reconstruct noisy or degraded input, like so, which can be a useful end in and of itself.
PCA is restricted to a linear map, while auto encoders can have nonlinear enoder/decoders.
A single layer auto encoder with linear transfer function is nearly equivalent to PCA, where nearly means that the $W$ found by AE and PCA won't necessarily be the same - but the subspace spanned by the respective $W$'s will.
Best Answer
Finding the differences can be done by looking at the models. Let's look at sparse coding first.
Sparse coding
Sparse coding minimizes the objective $$ \mathcal{L}_{\text{sc}} = \underbrace{||WH - X||_2^2}_{\text{reconstruction term}} + \underbrace{\lambda ||H||_1}_{\text{sparsity term}} $$ where $W$ is a matrix of bases, H is a matrix of codes and $X$ is a matrix of the data we wish to represent. $\lambda$ implements a trade of between sparsity and reconstruction. Note that if we are given $H$, estimation of $W$ is easy via least squares.
In the beginning, we do not have $H$ however. Yet, many algorithms exist that can solve the objective above with respect to $H$. Actually, this is how we do inference: we need to solve an optimisation problem if we want to know the $h$ belonging to an unseen $x$.
Auto encoders
Auto encoders are a family of unsupervised neural networks. There are quite a lot of them, e.g. deep auto encoders or those having different regularisation tricks attached--e.g. denoising, contractive, sparse. There even exist probabilistic ones, such as generative stochastic networks or the variational auto encoder. Their most abstract form is $$ D(d(e(x;\theta^r); \theta^d), x) $$ but we will go along with a much simpler one for now: $$ \mathcal{L}_{\text{ae}} = ||W\sigma(W^TX) - X||^2 $$ where $\sigma$ is a nonlinear function such as the logistic sigmoid $\sigma(x) = {1 \over 1 + \exp(-x)}$.
Similarities
Note that $\mathcal{L}_{sc}$ looks almost like $\mathcal{L}_{ae}$ once we set $H = \sigma(W^TX)$. The difference of both is that i) auto encoders do not encourage sparsity in their general form ii) an autoencoder uses a model for finding the codes, while sparse coding does so by means of optimisation.
For natural image data, regularized auto encoders and sparse coding tend to yield very similar $W$. However, auto encoders are much more efficient and are easily generalized to much more complicated models. E.g. the decoder can be highly nonlinear, e.g. a deep neural network. Furthermore, one is not tied to the squared loss (on which the estimation of $W$ for $\mathcal{L}_{sc}$ depends.)
Also, the different methods of regularisation yield representations with different characteristica. Denoising auto encoders have also been shown to be equivalent to a certain form of RBMs etc.
But why?
If you want to solve a prediction problem, you will not need auto encoders unless you have only little labeled data and a lot of unlabeled data. Then you will generally be better of to train a deep auto encoder and put a linear SVM on top instead of training a deep neural net.
However, they are very powerful models for capturing characteristica of distributions. This is vague, but research turning this into hard statistical facts is currently conducted. Deep latent Gaussian models aka Variational Auto encoders or generative stochastic networks are pretty interesting ways of obtaining auto encoders which provably estimate the underlying data distribution.