Solved – Dropout before Batch Normalization

deep learningdropoutkeras

In the last course of the Deep Learning Specialization on Coursera from Andrew Ng, you can see that he uses the following sequence of layers on the output of an LSTM layer:

Dropout -> BatchNorm -> Dropout.

To be honest, I do not see any sense in this. I don't think dropout should be used before batch normalization, depending on the implementation in Keras, which I am not completely familiar with, dropout either has no effect or has a bad effect.

I might be missing something here, though, and if anyone has any knowledge of why something like this could be useful, I'd love to hear from them.

Best Answer

The way I see it, it introduces much more noise into the model that a single batch normalization layer. But as shown in https://arxiv.org/pdf/1801.05134.pdf, dropout doesn't go well with batch normalization. Noone says Andrew Ng is infallible.