Solved – Optimizing parameters for CNN autoencoder based on training and validation loss

autoencodersconvolutionhyperparameterneural networks

I have designed an autoencoder with a encoder and decoder consiting of 2D convolutational layers (the input are 40'000 2D images). I train the autoencoder using adam optimizer. The autoencoders has the following hyperparameters which I would like to tune (in brackets are my default values):

  • Number of layers in encoder and decoder (I start with 2 in decoder and encoder)
  • Filter size for convolutional layers (I start with 32 and 64)
  • Convolutional kernel size (I start with 3×3)
  • Stride size (I start with 2×2)
  • Dropout (I start with 0.25 after each layer)
  • Learning rate (0.001)
  • learning_rate_decay (0)
  • Latent dimension (I start with 8)
  • Number of units in the dense layer (layer before creating latent space, I start with 16)
  • Batch size (I start with 128)

One possibility would be to use just grid or random search but this is very inefficient and takes a long time with so many hyperparameters. Instead, I would like to observe the training and validation loss (using tensorboard) and adjust the parameters accordingly. For example when observing the training and validation loss there could be overfitting or underfitting (or also an increase in loss etc.).

Are there some general rules or hints how the hyperparameters could be adjusted based on the observed losses or based on other criterions?

Best Answer

As you have correctly assumed, there are simply too many parameters to perform a full grid search. Luckily, many of those have relatively small effect compared to others:

  • Convolution kernel size (3x3 works universally quite well)
  • Convolution stride (2x2)
  • Batch size
  • Dropout is not needed for convolutional layers; also, for dense layers use it only if you experience overfitting

There are some rules of thumb for setting the learning rate and its decay, a popular one is to increase the learning rate and record the loss, which should yield a curve like this one (taken from this article): Curve showing the change of loss w.r.t. learning rate, first going down and then growing fast

The region of the steepest loss descent hints to the optimal learning rate choice. Alternatively, optimizers like Adam are robust to the choice of the learning rate (since the amount of change is adaptive for each parameter), but they have their own parameters to tune for optimal performance.

What remains are:

  • number of layers,
  • number of units per layer,
  • latent dimension

These should be easy to tune: start with a small network, increasing the size as long as your validation error gets better.