Solved – Optimizing parameters for CNN autoencoder based on training and validation loss

autoencodersconvolutionhyperparameterneural networks

I have designed an autoencoder with a encoder and decoder consiting of 2D convolutational layers (the input are 40'000 2D images). I train the autoencoder using adam optimizer. The autoencoders has the following hyperparameters which I would like to tune (in brackets are my default values):

Number of layers in encoder and decoder (I start with 2 in decoder and encoder)
Filter size for convolutional layers (I start with 32 and 64)
Convolutional kernel size (I start with 3×3)
Stride size (I start with 2×2)
Dropout (I start with 0.25 after each layer)
Learning rate (0.001)
learning_rate_decay (0)
Latent dimension (I start with 8)
Number of units in the dense layer (layer before creating latent space, I start with 16)
Batch size (I start with 128)

One possibility would be to use just grid or random search but this is very inefficient and takes a long time with so many hyperparameters. Instead, I would like to observe the training and validation loss (using tensorboard) and adjust the parameters accordingly. For example when observing the training and validation loss there could be overfitting or underfitting (or also an increase in loss etc.).

Are there some general rules or hints how the hyperparameters could be adjusted based on the observed losses or based on other criterions?

Best Answer

As you have correctly assumed, there are simply too many parameters to perform a full grid search. Luckily, many of those have relatively small effect compared to others:

Convolution kernel size (3x3 works universally quite well)
Convolution stride (2x2)
Batch size
Dropout is not needed for convolutional layers; also, for dense layers use it only if you experience overfitting

There are some rules of thumb for setting the learning rate and its decay, a popular one is to increase the learning rate and record the loss, which should yield a curve like this one (taken from this article):

The region of the steepest loss descent hints to the optimal learning rate choice. Alternatively, optimizers like Adam are robust to the choice of the learning rate (since the amount of change is adaptive for each parameter), but they have their own parameters to tune for optimal performance.

What remains are:

number of layers,
number of units per layer,
latent dimension

These should be easy to tune: start with a small network, increasing the size as long as your validation error gets better.

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

You might gain more insight by visualizing the weights instead of just the reconstructions. I had a similar problem when my biases were misconfigured. Everything below is written based on my experiences writing my own learning library. You can see the code here on Github http://github.com/josephcatrambone/aij.

Here is a screenshot of my program when there are no biases. This is after only maybe ten epochs since I'm in a hurry to finish this writeup:

The weight update is done by these operations:

weights.add_i(positiveProduct.subtract(negativeProduct).elementMultiply(learningRate / (float) batchSize));
//visibleBias.add_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));
//hiddenBias.add_i(positiveHiddenProbabilities.subtract(negativeHiddenProbabilities).meanRow().elementMultiply(learningRate));

If I uncomment the visible bias code, I get this result:

If I screw up the sign of the visible bias code (subtracting instead of adding):

visibleBias.subtract_i(batch.subtract(negativeVisibleProbabilities).meanRow().elementMultiply(learningRate));

I get this image:

Which snowballs and eventually reaches something like what you have above. Check the signage of your error functions.

Solved – Denoising Autoencoder not training properly

Hyperparameter choice is something that can't really be answered: sure, there are some set of procedures that can be followed, but it's largely a case of hit and trial.

Single DA's can indeed extract meaningful features, however, most of the features in case of encoding dimensions 'L' (say) > input dimensions 'D' (i.e. Overcomplete learning) will end up being random noise. The reason for your autoencoder not learning meaningful features is because given the degree of freedom the autoencoder has in the encoding layer (i.e. L > D) it becomes quite easy for the autoencoder to learn an identity mapping of the input.

So to alleviate this problem, you have to put additional constraints in order to limit this degree of freedom. I believe you can try the following and see what the outcome is:

The first and probably the easiest step would be to try and reduce the number of encoding layer nodes from 1000 to something little closer to the dimensions of the input, ie. 784. I would say 800 would be a good start. Visualize the features then and see if some features have improved.
Apply additional regularization constraints, say l2 regularization on the weights (and if already doing that, increase the penalty term corresponding to l2) and other such penalization techniques.
Tied weights. Use tied weights on the encoding layer and the decoding layer if not doing so already. ie. W_decoding = W_encoding.T. When not using tied weights, many times, either of the two layers learn larger, better weights (for the lack of words) and compensate for the poor weights learned by the other. By placing this constraint we force the autoencoder to learn a balanced set of weights. Also, it often results in improvement of training time as well as a pretty good limitation on the degree of freedom (the number of free, trainable parameters is halved!).

Give this a try. Might help.

Best Answer

Related Solutions

Solved – Cannot make this autoencoder network function properly (with convolutional and maxpool layers)

Solved – Denoising Autoencoder not training properly

Related Question