Neural Networks – Addressing Distorted Validation Loss with Batch Normalization in Convolutional Autoencoders

autoencodersbatch normalizationkerasneural networks

I have implemented an variational autoencoder with convolutional layers in Keras. I have around 40'000 training images and 4000 validation images. The images are heat maps. The encoder and decoder are symmetric. In total I have 3 layers (32, 64, 128 feature maps with stride 2). After each layer I have a batch normalization after relu activation layer.

The problem is that without batch normalization the training and validation loss decreases as expected and are smooth but when inserting batch normalization I either face one huge peak in the validation loss (see left image) or the validation loss is very bumpy (see right image). I have played around with a momentum of 0.99 and 0.9 for batch normalization layer. If I use a momentum of 0.9 only cases as in the left image appears.

What can I do against it? Not using batch normalization at all? As said without batch normalization the validation loss behaves like the training loss but I think today everybody is using batch normalization…

Best Answer

It's common practice to avoid using batch normalization when training VAEs, since the additional stochasticity due to using mini-batches may aggravate instability on top of the stochasticity from sampling.

Source: https://www.tensorflow.org/beta/tutorials/generative/cvae

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

I am currently exploring stacked-convolutional autoencoders.

I will try and answer some of your questions to the best of my knowledge. Mind you, I might be wrong so take it with a grain of salt.

Yes, you have to "reverse" pool and then convolve with a set of filters to recover your output image. A standard neural network (considering MNIST data as input, i.e. 28x28 input dimensions) would be:

    28x28(input) -- convolve with 5 filters, each filter 5x5 -->  5 @ 28 x 28 maps -- maxPooling --> 5 @ 14 x 14 (Hidden layer) -- reverse-maxPool --> 5 @ 28 x 28 -- convolve with 5 filters, each filter 5x5 --> 28x28 (output)

My understanding is that conventionally that is what one should do, i.e. train each layer separately. After that you stack the layers and train the entire network once more using the pre-trained weights. However, Yohsua Bengio has some research (the reference escapes my memory) showcasing that one could construct a fully-stacked network and train from scratch.
My understanding is that "noise layer" is there to introduce robustness/variability in the input so that the training does not overfit.
As long as you are still "training" pre-training or fine-tuning, I think the reconstruction part (i.e. reversePooling, de-convolution etc) is necesary. Otherwise how should one perform error-back-propagation to tune weights?
I have tried browsing through numerous papers, but the architecture is never explained in full. If you find any please do let me know.

Solved – Understanding batch normalization

Their notation is confusing. For instance, they use the hat symbol to denote the normalized variable $\hat x=x-E[x]$, then even worse they use the expectation symbol to denote what is clearly a sample average estimator: $E[x]=\frac 1 N\sum_{i=1}^Nx_i$

I assume that when you wrote $E[b]=b$, you mean the expectation of bias. That's not what they would denote by $E[b]$: they'd mean a simple sample average estimator, the quantity that is usually denoted as $\bar b=\frac 1 N\sum_jb_j$, where $b_j$ is the bias learned from a batch $j$.

All they're saying in this paragraph is that you have to be mindful of how you implement normalization, because if you do it wrong then it will interfere with the gradient descent. Their example is first learning the bias, then normalizing. So, you learn the bias $\Delta b$ from a batch, but then when you normalize subsequently you negate the learning by subtracting what you learned.

It goes like this. First you learned $\Delta b$, which means you are prepared to output $u+b+\Delta b$. However, you squeezed another operation just before outputting from the layer: you decided to normalize. So, you subtract the sample average, i.e. what they denote unfortunately by $E[u+b+\Delta b]$. This will lead to $E[\Delta b]$ cancelling each other in gradient descent and normalization, i.e.e you didn't learn anything.

Best Answer

Related Solutions

Solved – the architecture of a stacked convolutional autoencoder

Solved – Understanding batch normalization

Related Question