Solved – Denoising Autoencoder not training properly

autoencodershyperparameterneural networkstensorflow

I've implemented a denoising autoencoder using TensorFlow. The code is here, there is also a command line script to launch it. The code seems to work, the cross-validation error is decreasing every iteration, but the autoencoder doesn't seem to be learning good features (I'm using MNIST).
This is an example of learned features:
enter image description here

The parameters I used are the following:

--n_components 1000 --batch_size 25 --n_iter 100 --verbose 1 --learning_rate 0.01 --weight_images 0 --corr_type masking --corr_frac 0.5 --encode_valid --enc_act_func sigmoid --dec_act_func sigmoid --loss_func cross_entropy --opt momentum --momentum 0.9 --dropout 0.5

number of hidden units: 1000

batch_size: 25

epochs: 100

learning rate: 0.01

input corruption type and frac: masking 0.5 (set 50% of the pixels to zero)

encoder activation function: sigmoid

decoder activation function: sigmoid

loss function: cross entropy

optimizer: momentum, 0.9

encoder layer dropout probability: 0.5

The question is: what is a good choice of the hyperparameters for the MNIST dataset?

Best Answer

Hyperparameter choice is something that can't really be answered: sure, there are some set of procedures that can be followed, but it's largely a case of hit and trial.

Single DA's can indeed extract meaningful features, however, most of the features in case of encoding dimensions 'L' (say) > input dimensions 'D' (i.e. Overcomplete learning) will end up being random noise. The reason for your autoencoder not learning meaningful features is because given the degree of freedom the autoencoder has in the encoding layer (i.e. L > D) it becomes quite easy for the autoencoder to learn an identity mapping of the input.

So to alleviate this problem, you have to put additional constraints in order to limit this degree of freedom. I believe you can try the following and see what the outcome is:

  1. The first and probably the easiest step would be to try and reduce the number of encoding layer nodes from 1000 to something little closer to the dimensions of the input, ie. 784. I would say 800 would be a good start. Visualize the features then and see if some features have improved.

  2. Apply additional regularization constraints, say l2 regularization on the weights (and if already doing that, increase the penalty term corresponding to l2) and other such penalization techniques.

  3. Tied weights. Use tied weights on the encoding layer and the decoding layer if not doing so already. ie. W_decoding = W_encoding.T. When not using tied weights, many times, either of the two layers learn larger, better weights (for the lack of words) and compensate for the poor weights learned by the other. By placing this constraint we force the autoencoder to learn a balanced set of weights. Also, it often results in improvement of training time as well as a pretty good limitation on the degree of freedom (the number of free, trainable parameters is halved!).

Give this a try. Might help.