Solved – Sparse Autoencoder [Hyper]parameters

autoencodersdeep learningdeep-belief-networksneural networksoptimization

I have just started using the autoencoder package in R.

Inputs to the autoencode() function include lambda, beta, rho and epsilon.

What are the bounds for these values? Do they vary for each activation function? Are these parameters called "hyperparameters"?

Assuming a sparse autoencoder, is rho=.01 good for the logistic activation function and rho=-.9 good for the hyperbolic tangent activation function?

Why does the manual set epsilon to .001? If I remember correctly, "Efficient Backpropagation" by LeCun recommends starting values which are not so close to zero.

How much does a "good" value for beta matter?

Is there a "rule of thumb" for choosing the number of nuerons in the hidden layer? For example, if the input layers has N nodes, is it reasonable to have 2N nuerons in the in the hidden layer?

Can you recommend some literature on the practical use of autoencoders?

Best Answer

The autoencoder package is just an implementation of the autoencoder described in Andrew Ng's class notes, which might be a good starting point for further reading. Now, to tackle your questions


People sometimes distinguish between *parameters*, which the learning algorithm calculates itself, and *hyperparameters*, which control that learning process and need to be provided to the learning algorithm. **It is important to realise that there are NO MAGIC VALUES** for the hyperparameters. The optimal value will vary, depending on the data you're modeling: you'll have to try them on your data.

a) Lambda ($\lambda$) controls how the weights are updated during backpropagation. Instead of just updating the weights based on the difference between the model's output and the ground truth), the cost function includes a term which penalizes large weights (actually the squared value of all weights). Lambda controls the relative importance of this penalty term, which tends to drag weights towards zero and helps avoid overfitting.

b) Rho ($\rho)$ and beta $(\beta$) control sparseness. Rho is the expected activation of a hidden unit (averaged across the training set). The representation will become sparser and sparser as it becomes smaller. This sparseness is imposed by adjusting the bias term, and beta controls the size of its updates. (It looks like $\beta$ actually just rescales the overall learning rate $\alpha$.)

c) Epsilon ($\epsilon)$ controls the initial weight values, which are drawn at random from $N(0, \epsilon^2)$.

Your rho values don't seem unreasonable since both are near the bottom of the activation function's range (0 to 1 for logistic, -1 to 1 for tanh). However, this obviously depends on the amount of sparseness you want and the number of hidden units you use too.


LeCunn's major concern with small weights that the error surface becomes very flat near the origin if you're using a symmetric sigmoid. Elsewhere in that paper, he recommends initializing with weights randomly drawn from a normal distribution with zero mean and $m^{-1/2}$ standard deviation, where $m$ is the number of connections each unit receives.


There are lots of "rules of thumb" for choosing the number of hidden units. Your initial guess (2x input) seems in line with most of them. That said, these guesstimates are much more guesswork than estimation. Assuming you've got the processing power, I would err on the side of more hidden units, then enforce sparseness with a low rho value.
One obvious use of autoencoders is to generate more compact feature representations for other learning algorithms. A raw image might have millions of pixels, but a (sparse) autoencoder can re-represent that in a much smaller space. [Geoff Hinton][2] (and others) have shown that they generate useful features for subsequent classification. Some of the deep learning work uses autoencoders or similar to pretrain the network. [Vincent et al.][3] use autoencoders directly to perform classification.

The ability to generate succinct feature representations can be used in other contexts as well. Here's a neat little project where autoencoder-produced states are used to guide a reinforcement learning algorithm through Atari games.

Finally, one can also use autoencoders to reconstruct noisy or degraded input, like so, which can be a useful end in and of itself.