The autoencoder
package is just an implementation of the autoencoder described in Andrew Ng's class notes, which might be a good starting point for further reading. Now, to tackle your questions
People sometimes distinguish between *parameters*, which the learning algorithm calculates itself, and *hyperparameters*, which control that learning process and need to be provided to the learning algorithm. **It is important to realise that there are NO MAGIC VALUES** for the hyperparameters. The optimal value will vary, depending on the data you're modeling: you'll have to try them on your data.
a) Lambda ($\lambda$) controls how the weights are updated during backpropagation. Instead of just updating the weights based on the difference between the model's output and the ground truth), the cost function includes a term which penalizes large weights (actually the squared value of all weights). Lambda controls the relative importance of this penalty term, which tends to drag weights towards zero and helps avoid overfitting.
b) Rho ($\rho)$ and beta $(\beta$) control sparseness. Rho is the expected activation of a hidden unit (averaged across the training set). The representation will become sparser and sparser as it becomes smaller. This sparseness is imposed by adjusting the bias term, and beta controls the size of its updates. (It looks like $\beta$ actually just rescales the overall learning rate $\alpha$.)
c) Epsilon ($\epsilon)$ controls the initial weight values, which are drawn at random from $N(0, \epsilon^2)$.
Your rho values don't seem unreasonable since both are near the bottom of the activation function's range (0 to 1 for logistic, -1 to 1 for tanh). However, this obviously depends on the amount of sparseness you want and the number of hidden units you use too.
LeCunn's major concern with small weights that the error surface becomes very flat near the origin if you're using a symmetric sigmoid. Elsewhere in that paper, he recommends initializing with weights randomly drawn from a normal distribution with zero mean and $m^{-1/2}$ standard deviation, where $m$ is the number of connections each unit receives.
There are lots of "rules of thumb" for choosing the number of hidden units. Your initial guess (2x input) seems in line with most of them. That said, these guesstimates are much more guesswork than estimation. Assuming you've got the processing power, I would err on the side of more hidden units, then enforce sparseness with a low rho value.
One obvious use of autoencoders is to generate more compact feature representations for other learning algorithms. A raw image might have millions of pixels, but a (sparse) autoencoder can re-represent that in a much smaller space. [Geoff Hinton][2] (and others) have shown that they generate useful features for subsequent classification. Some of the deep learning work uses autoencoders or similar to pretrain the network. [Vincent et al.][3] use autoencoders directly to perform classification.
The ability to generate succinct feature representations can be used in other contexts as well. Here's a neat little project where autoencoder-produced states are used to guide a reinforcement learning algorithm through Atari games.
Finally, one can also use autoencoders to reconstruct noisy or degraded input, like so, which can be a useful end in and of itself.
An alternative to dimensionality reduction is to use the hashing trick to train a classifier on the entire feature set without reduction beforehand.* The Vowpal Wabbit pwoject--er, project--is an implementation of various learning algorithms using the hashing trick to speed up computation:
VW is the essence of speed in machine learning, able to learn from terafeature datasets with ease. Via parallel learning, it can exceed the throughput of any single machine network interface when doing linear learning, a first amongst learning algorithms.
I don't know if VW will end up being right for you (if you have billions of features, a lot of your choices may end up being dictated by software engineering considerations), but hopefully it's a pointer in the right direction!
* Well, the hashing trick is technically a kind of dimensionality reduction, but only in a very silly sense.
Best Answer
I never used autoencoders for sparse data, but my first reaction was "why should this matter?". I found the question interesting, so I made a small Google search and among the first search results, I found an answer by Ian Googfellow to a similar question, who says
I guess, this could be considered as an authoritative answer for your question.
I don't know how other frameworks, but Keras and TensorFlow support sparse matrices (if it is a matter of memory performance). All you need is a dense, convolutional, or recurrent (depending on nature of your data) layer, or layers, as encoder, and same things for decoder, where on the output layer you would need something like
exp
function to transform the outputs to non-negative values, if it is counts data (think of Poisson regression).