Solved – “Denoising” autoencoder with distortions other than Gaussian noise

autoencodersmachine learningneural networks

I watched some talks by Yoshua Bengio, where he often refers to denoising auto-encoders (AE) as a powerful method to learn (unsupervised) representations on an input space (e.g. here).

The idea as far as I have understood it, is that the valid input only lives on a manifold of much smaller dimension than the complete input space. E.g. if we use bitmap representations of 1000px, then the complete input space is the $\mathbb{R}^{1000}$. But if we just use letters and numbers as inputs, this data may only use a much lower dimension, let's say something comparable to the $\mathbb{R}^{50}$. Thus, the first layers of the AE just learn a non-linear projection of the $\mathbb{R}^{1000}$ to the $\mathbb{R}^{50}$.

Now by adding (Gaussian) noise to the input during training, the performance is increased. The interpretation is straightforward, the noise offsets the data away from the manifold, and the AE learns to return this corrupted data back to the manifold (here).

My main question is if adding Gaussian noise is really the only kind of distortion that we should consider when training denoising AE. For humans, it is easily possible to reconstruct the original letter, even if it is rotated or upside down. Could we train an AE to also project upside down or rotated versions of the same letter to the same point on the manifold?

But rotated and upside down versions of the same letter are really far away in the original input space under e.g. $L^2$ norm, whereas the same image with added Gaussian noise is quite close. Would an AE still be able to learn this much more complex task of mapping such different inputs to the same manifold. While NNs can be universal approximators in general, such a mapping may require too many neurons or training time. Also, to be an AE, the network must have a bottleneck, which violates the universal approximation assumptions.

Best Answer

Complicated question. Yes, you could definitely use some other kind of noise, and it doesn't have to be additive. As a matter of fact, the original paper on denoising autoencoders did not use additive Gaussian noise: it randomly set some of the input pixel intensities to 0, which is basically dropout (multiplicative Bernoulli noise) applied to the input layer:

P. Vincent, H. Larochelle Y. Bengio and P.A. Manzagol, "Extracting and Composing Robust Features with Denoising Autoencoders"

However, you ask more specifically if we could apply rotations, reflections, and let me add, maybe translations? to the input letters (or more generally, images). You're afraid, however, that the outputs of such transformations would be very far away from the original images, in terms of $L^2$ loss. And here is where things get complicated.

First of all, let's clarify a point: an AE will never be able to exactly learn to invert rotations and/or translations on a input image, i.e., learn an isometry, for the same reason why it can't learn the identity mapping. Because of the bottleneck layer, the AE becomes a form of nonlinear dimensionality reduction method, and thus it cannot preserve distances. Thus, perfect reconstruction will be impossible, and indeed that's not the goal of the AE. But could it learn to approximately map rotated versions of the same input, to a similar (not identical) output? Well, maybe: of course, as you cleverly noted, for the AE to learn that, some loss other than $L^2$ must be introduced.

People have indeed experimented with using losses other than $L^2$ for autoencoders, actually not in an attempt to "invert" isometries applied to the input image, but to combat the blurry artifacts introduced by $L^2$ loss. As a matter of fact, minimizing the squared Euclidean distance between an input image and the autoencoder output obviously favours blurry reconstruction: if the input image contains strong contrasts (for example, an edge separating a strongly illuminated area from a darker one), then a small translation of the edge normal to its direction will lead to local, large discrepancies in pixel intensities between input and output, increasing the $L^2$ loss. Thus, the autoencoder "learns" to blur contrasts/edges.

To avoid this, people have tried using perceptual similarity loss in place of $L^2$ loss for training autoencoders, i.e., the difference between two images’ hidden representations extracted from a deep CNN, either pretrained, such as AlexNet or VGGNet, or trained during the autoencoder training. Success has been mixed: sometimes even the two kind of losses ($L^2$ and perceptual) summed together haven't been enough to ensure a good reconstruction, and it has been necessary to add a third adversarial loss as a sort of "natural image prior";

Alexey Dosovitskiy, Thomas Brox, "Generating Images with Perceptual Similarity Metrics based on Deep Networks"

Some other times, the perceptual loss "alone" was enough to guarantee a sharp reconstruction: however, this was in the context of variational autoencoders, not denoising autoencoders, where any reconstruction loss, be it pixel-wise or perceptual, is linearly combined with a KL-divergence loss, thus it isn't really minimized by itself. Also, the goal of the VAE is not really to reconstruct the input, but to learn to generate new images from the same distribution as the training set, so it's conceptually different from an autoencoder (the VAE is a generative model). Famous example:

Xianxu Hou, Linlin Shen, Ke Sun, Guoping Qiu, "Deep Feature Consistent Variational Autoencoder"

Both these AEs/VAEs may learn to ignore small translations and deformations in the input, because convolutional layers are equivariant to translations (this will always be approximate, both because a CNN is not made only of convolutional layers, and because of the AE/VAE bottleneck layer).

However, convolutional layers are not equivariant to rotations, so two inputs which are identical except for a rotation, will not be mapped to the same manifold "point". To avoid that, you may try to use a loss based on the representations extracted by group-equivariant CNNs or steerable CNNs: these are innovative architectures, which are equivariant to discrete rotations (group-equivariant CNNs) or to continuous ones (steerable CNNs). I don't think anyone ever tried to use a "group-equivariant" perceptual loss for an AE, but you could be the first one to test it!

Related Question