Sparse autoencoders is a unsupervised learning algorithm which tries to learn an identity function of the input. As mentioned in the notes of Andrew Ng's lecture on deep learning the average activation of neurons in the hidden layer over the training set are restricted lets say to 0.01 (rho) which is called the sparsity parameter. I am confused as to why would we be interested to restrict the activation of hidden neurons ?
Solved – the intuition behind the sparsity parameter in sparse autoencoders
autoencodersdeep learningunsupervised learning
Related Solutions
The autoencoder
package is just an implementation of the autoencoder described in Andrew Ng's class notes, which might be a good starting point for further reading. Now, to tackle your questions
People sometimes distinguish between *parameters*, which the learning algorithm calculates itself, and *hyperparameters*, which control that learning process and need to be provided to the learning algorithm. **It is important to realise that there are NO MAGIC VALUES** for the hyperparameters. The optimal value will vary, depending on the data you're modeling: you'll have to try them on your data.
a) Lambda ($\lambda$) controls how the weights are updated during backpropagation. Instead of just updating the weights based on the difference between the model's output and the ground truth), the cost function includes a term which penalizes large weights (actually the squared value of all weights). Lambda controls the relative importance of this penalty term, which tends to drag weights towards zero and helps avoid overfitting.
b) Rho ($\rho)$ and beta $(\beta$) control sparseness. Rho is the expected activation of a hidden unit (averaged across the training set). The representation will become sparser and sparser as it becomes smaller. This sparseness is imposed by adjusting the bias term, and beta controls the size of its updates. (It looks like $\beta$ actually just rescales the overall learning rate $\alpha$.)
c) Epsilon ($\epsilon)$ controls the initial weight values, which are drawn at random from $N(0, \epsilon^2)$.
Your rho values don't seem unreasonable since both are near the bottom of the activation function's range (0 to 1 for logistic, -1 to 1 for tanh). However, this obviously depends on the amount of sparseness you want and the number of hidden units you use too.
LeCunn's major concern with small weights that the error surface becomes very flat near the origin if you're using a symmetric sigmoid. Elsewhere in that paper, he recommends initializing with weights randomly drawn from a normal distribution with zero mean and $m^{-1/2}$ standard deviation, where $m$ is the number of connections each unit receives.
There are lots of "rules of thumb" for choosing the number of hidden units. Your initial guess (2x input) seems in line with most of them. That said, these guesstimates are much more guesswork than estimation. Assuming you've got the processing power, I would err on the side of more hidden units, then enforce sparseness with a low rho value.
One obvious use of autoencoders is to generate more compact feature representations for other learning algorithms. A raw image might have millions of pixels, but a (sparse) autoencoder can re-represent that in a much smaller space. [Geoff Hinton][2] (and others) have shown that they generate useful features for subsequent classification. Some of the deep learning work uses autoencoders or similar to pretrain the network. [Vincent et al.][3] use autoencoders directly to perform classification.
The ability to generate succinct feature representations can be used in other contexts as well. Here's a neat little project where autoencoder-produced states are used to guide a reinforcement learning algorithm through Atari games.
Finally, one can also use autoencoders to reconstruct noisy or degraded input, like so, which can be a useful end in and of itself.
Are you familiar with the notion of over-fitting? The problem here is that a complete auto-encoder has too many degrees of freedom, so it can easily match any input data without constraining how it must match unseen data. Without such a constraint, there is nothing pushing the learning system to find a hypothesis which matches the training data and makes predictions about unseen data in a way that matches the "pattern" followed by the training data.
To make this concrete, imagine a very constrained classifier that must find a linear relationship between its single input variable 'x' and its output variable 'y'. With a small bit of data, this learner can find the best possible linear relationship between x and y. Contrast this with a learner that is allows to output and arbitrarily long list of x,y pairs, and it will output the 'y' value closest to the nearest 'x' value.
if the data is very noisy, then even if there is a pretty good linear relationship between x and y, it will do a bad job of predicting the output y. this is because the nearest x value in the training data is pretty noisy, so it may be anywhere.
This learning system 'overfit' the data since it has very precisely modeled the training data (getting a perfect score, by putting a x/y pair ontop of each training data point.) But this hypothesis space was too flexible, the learning algorithm was not forced to select from a constrained set of patterns (like linear relations between x and y), thus the learning algorithm never learned anything interesting about the unseen points, in merely copied the exact shape of the training data into the learned model.
In an analogous way dense auto encoders allow perfect encoding of the any training data, thus providing no constraint to drive generalization.
Related Question
- Solved – How to train and fine-tune fully unsupervised deep neural networks
- Solved – Which approach is better in feature learning, deep autoencoders or stacked autoencoders
- Solved – Is it possible to make multi-layer autoencoder learn to completely repeat input
- Solved – Neural Networks – Performance VS Amount of Data
- Solved – Why are RNN hidden layers more computationally expensive
Best Answer
An autoencoder attempts to reconstruct the input. During the process it could learn the identity function if the size of the hidden layers is greater than the number of inputs. However that is not desirable.
During learning, the autoencoder discovers the most common features in the input. For example if the input is a natural image, it discovers an edge because it is the most common feature in all natural images.
In the simplest case, the autoencoder is constructed with fewer hidden units than its input layer. As hidden units are added, it can enlist more features to represent the input. However, as the number of hidden units exceeds the number of input units, the features becomes more and more dependent. The autoencoder can discover those features when the hidden layers are densely activated.
Sparsity restricts the activation of the hidden units, which reduces the dependency between features. This allows us to increase the number of features, which is desirable.