I'm training an MLP neural network with one hidden layer and batch gradient descent using Keras/Tensorflow.
Applying dropout to the input layer increased the training time per epoch by about 25 %, independent of the dropout rate.
That dropout increases the number of epochs needed to reach a validation loss minimum is clear, but I thought that the training time per epoch would decrease by dropping out units.
Does anyone know the reason?
Best Answer
That's not the case. I understand your rationale though. You thought that zeroing out components would make for less computation. That would be the case for sparse matrices, but not for dense matrices.
TensorFlow, and any deep learning framework for that matter, uses vectorized operations on dense vector*. This means that number of zeros makes no difference, since you're going to calculate matrix operations using all entries.
In reality, the opposite is true, because dropout requires
* They also support sparse matrices, but they don't make sense for most weights because they're useful mostly if you have far less than half of entries equal to zero.