Hyperparameter choice is something that can't really be answered: sure, there are some set of procedures that can be followed, but it's largely a case of hit and trial.
Single DA's can indeed extract meaningful features, however, most of the features in case of encoding dimensions 'L' (say) > input dimensions 'D' (i.e. Overcomplete learning) will end up being random noise.
The reason for your autoencoder not learning meaningful features is because given the degree of freedom the autoencoder has in the encoding layer (i.e. L > D
) it becomes quite easy for the autoencoder to learn an identity mapping of the input.
So to alleviate this problem, you have to put additional constraints in order to limit this degree of freedom.
I believe you can try the following and see what the outcome is:
The first and probably the easiest step would be to try and reduce the number of encoding layer nodes from 1000 to something little closer to the dimensions of the input, ie. 784. I would say 800 would be a good start. Visualize the features then and see if some features have improved.
Apply additional regularization constraints, say l2 regularization on the weights (and if already doing that, increase the penalty term corresponding to l2) and other such penalization techniques.
Tied weights. Use tied weights on the encoding layer and the decoding layer if not doing so already. ie.
W_decoding = W_encoding.T
.
When not using tied weights, many times, either of the two layers learn larger, better weights (for the lack of words) and compensate for the poor weights learned by the other. By placing this constraint we force the autoencoder to learn a balanced set of weights. Also, it often results in improvement of training time as well as a pretty good limitation on the degree of freedom (the number of free, trainable parameters is halved!).
Give this a try. Might help.
Complicated question. Yes, you could definitely use some other kind of noise, and it doesn't have to be additive. As a matter of fact, the original paper on denoising autoencoders did not use additive Gaussian noise: it randomly set some of the input pixel intensities to 0, which is basically dropout (multiplicative Bernoulli noise) applied to the input layer:
P. Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, "Extracting and Composing Robust Features with Denoising Autoencoders"
However, you ask more specifically if we could apply rotations, reflections, and let me add, maybe translations? to the input letters (or more generally, images). You're afraid, however, that the outputs of such transformations would be very far away from the original images, in terms of $L^2$ loss. And here is where things get complicated.
First of all, let's clarify a point: an AE will never be able to exactly learn to invert rotations and/or translations on a input image, i.e., learn an isometry, for the same reason why it can't learn the identity mapping. Because of the bottleneck layer, the AE becomes a form of nonlinear dimensionality reduction method, and thus it cannot preserve distances. Thus, perfect reconstruction will be impossible, and indeed that's not the goal of the AE. But could it learn to approximately map rotated versions of the same input, to a similar (not identical) output? Well, maybe: of course, as you cleverly noted, for the AE to learn that, some loss other than $L^2$ must be introduced.
People have indeed experimented with using losses other than $L^2$ for autoencoders, actually not in an attempt to "invert" isometries applied to the input image, but to combat the blurry artifacts introduced by $L^2$ loss. As a matter of fact, minimizing the squared Euclidean distance between an input image and the autoencoder output obviously favours blurry reconstruction: if the input image contains strong contrasts (for example, an edge separating a strongly illuminated area from a darker one), then a small translation of the edge normal to its direction will lead to local, large discrepancies in pixel intensities between input and output, increasing the $L^2$ loss. Thus, the autoencoder "learns" to blur contrasts/edges.
To avoid this, people have tried using perceptual similarity loss in place of $L^2$ loss for training autoencoders, i.e., the difference between two images’ hidden representations extracted from a deep CNN, either pretrained, such as AlexNet or VGGNet, or trained during the autoencoder training. Success has been mixed: sometimes even the two kind of losses ($L^2$ and perceptual) summed together haven't been enough to ensure a good reconstruction, and it has been necessary to add a third adversarial loss as a sort of "natural image prior";
Alexey Dosovitskiy, Thomas Brox, "Generating Images with Perceptual Similarity Metrics based on Deep Networks"
Some other times, the perceptual loss "alone" was enough to guarantee a sharp reconstruction: however, this was in the context of variational autoencoders, not denoising autoencoders, where any reconstruction loss, be it pixel-wise or perceptual, is linearly combined with a KL-divergence loss, thus it isn't really minimized by itself. Also, the goal of the VAE is not really to reconstruct the input, but to learn to generate new images from the same distribution as the training set, so it's conceptually different from an autoencoder (the VAE is a generative model). Famous example:
Xianxu Hou, Linlin Shen, Ke Sun, Guoping Qiu, "Deep Feature Consistent Variational Autoencoder"
Both these AEs/VAEs may learn to ignore small translations and deformations in the input, because convolutional layers are equivariant to translations (this will always be approximate, both because a CNN is not made only of convolutional layers, and because of the AE/VAE bottleneck layer).
However, convolutional layers are not equivariant to rotations, so two inputs which are identical except for a rotation, will not be mapped to the same manifold "point". To avoid that, you may try to use a loss based on the representations extracted by group-equivariant CNNs or steerable CNNs: these are innovative architectures, which are equivariant to discrete rotations (group-equivariant CNNs) or to continuous ones (steerable CNNs). I don't think anyone ever tried to use a "group-equivariant" perceptual loss for an AE, but you could be the first one to test it!
Best Answer
the following paper
P. Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, Extracting and Composing Robust Features with Denoising Autoencoders, Proceedings of the Twenty-fifth International Conference on Machine Learning (ICML‘08), pages 1096 - 1103, ACM, 2008.
contains a variety of images on the MNIST dataset that show how well features are recognized when different levels of noise are added. Especially important are the last pictures where he shows that the more noise is added the better the network learns dependencies between variables. With low noise levels features do not stand out.
The link to the paper is the following: http://www.iro.umontreal.ca/~lisa/publications2/index.php/attachments/single/176
Also note that the noise that is added is not really white noise or anything similar. It is simply setting 20,30,50% of the values at random to zero.