Solved – Is it possible to make multi-layer autoencoder learn to completely repeat input

autoencodersdeep learningmachine learning

I'm currently playing with auto-encoders, so the question is more about research than practical implementation. I know that if I reduce capacity of auto-encoder by making hidden layer smaller, I'll get some kind of factorization of the input (adding noise is more effective, but let's forget about that for a while).

However I also read that, given enough neurons on hidden layer auto-encoder will just learn to repeat input, so I wanted to check that worst case (to see how it actually does it) by building a small simple MLP:

  • no noise at the input
  • number of layers: 3 (inputN=4, contextN=4, outputN=4)
  • activation: ReLU
  • weight init: XAVIER
  • input/output number of features: 4
  • hidden layer size: 4
  • regularization: L2 (weight decay), tried no-regularization
  • SGD with back-propagation, no momentum
  • learning rate: 0.01 (tried 0.1, 0.001, 0.0001)
  • library: deeplearning4j

If I pass a single (or several) sample over and over – I can see that only some of output values repeat input, others are zeros. So, the output is becoming sparse, meaning that for let's say [0.1, 0.3, 0.7, 0.2] – I get something like [0.1, 0.0, 0.7, 0.0]. Running more iterations doesn't improve it after some point.

I believe the sparsity of result is due to gradient vanishing, probably because some of the initial weights were set close or equal to zero by XAVIER. I tried other random initializations (including normal distribution), but it didn't help; Of course, I could try to set weights as they should be (and I did – it worked), but it wouldn't show anything interesting.

So, is there any example of multi-layer auto-encoder that just repeats its input completely? Was my guess about weight initialization right or is there any another underlying reason? Does varying samples and order play any significant role in this particular case?

P.S. If this problem is not typical for this architecture – I could provide some reproducible example (in case there is a problem with my code or underlying library).

Best Answer

Why not start with one layer, and get the output equals the input for one layer? Then, by induction, you can just stack those one on top of the other, and just pipeline through your trained identity layers.

Note that if you're using ReLU, then if you have any negative values in your input/output, then ReLU will not be able to do the identity operation you seek. So you'd probably want to make sure your input/output is strictly positive.

Gradient vanishing refers to a specific effect of backpropagated gradients being attenuated by passage through an activation function, like tanh or sigmoid. ReLU doesnt suffer from vanishing gradients too much, since the gradient is 1, for positive domain input.

So:

  • check your inputs/outputs are strictly positive
  • start with a single layer

Edit: based on your new information that it works with a single layer, but not with multiple layers, I reckon that ReLU is blocking your gradient backprop, probably because it has zero gradient for much of its domain. Therefore, you might try using an activation function that has non-zero gradient almost everywhere, eg ELU or leaky ReLU. On the whole, I'd try leaky ReLU first, because:

  • it is technically an activation function (cf not using an activation function at all)
  • its piecewise linear (cf ELU)
  • it has gradient almost everywhere, so gradients should backprop ok
Related Question