Solved – Why does dropout increase the training time per epoch in a neural network

dropoutkerasneural networks

I'm training an MLP neural network with one hidden layer and batch gradient descent using Keras/Tensorflow.
Applying dropout to the input layer increased the training time per epoch by about 25 %, independent of the dropout rate.

That dropout increases the number of epochs needed to reach a validation loss minimum is clear, but I thought that the training time per epoch would decrease by dropping out units.

Does anyone know the reason?

Best Answer

but I thought that the training time per epoch would decrease by dropping out units.

That's not the case. I understand your rationale though. You thought that zeroing out components would make for less computation. That would be the case for sparse matrices, but not for dense matrices.

TensorFlow, and any deep learning framework for that matter, uses vectorized operations on dense vector*. This means that number of zeros makes no difference, since you're going to calculate matrix operations using all entries.

In reality, the opposite is true, because dropout requires

  • additional matrices for dropout masks
  • drawing random numbers for each entry of these matrices
  • multiplying the masks and corresponding weights

* They also support sparse matrices, but they don't make sense for most weights because they're useful mostly if you have far less than half of entries equal to zero.

Related Question