Solved – Python Lasagne tutorial: validation error lower than training error

dropoutmachine learningneural networkspython

In the Lasagne tutorial (here and source code here) a simple multilayer perceptron is trained over the MNIST dataset.
The data is split in a training set and a validation set, and the training calculates the validation error on each epoch expressed as the average cross-entropy error per batch.

However, the validation error is always lower than the training error. Why does that happen? Shouldn't the training error be lower since it the data that the network is trained on? Could this be a result of the dropout layers (enabled during training, but disabled during validation error calculation)?

Output of the first few epochs:

Epoch 1 of 500 took 1.858s
  training loss:                1.233348
  validation loss:              0.405868
  validation accuracy:          88.78 %
Epoch 2 of 500 took 1.845s
  training loss:                0.571644
  validation loss:              0.310221
  validation accuracy:          91.24 %
Epoch 3 of 500 took 1.845s
  training loss:                0.471582
  validation loss:              0.265931
  validation accuracy:          92.35 %
Epoch 4 of 500 took 1.847s
  training loss:                0.412204
  validation loss:              0.238558
  validation accuracy:          93.05 %

Best Answer

Here is a possible explanation (could very well be wrong however), maybe you could try modifying their tutorial code to see if this works or not?

For minibatch descent methods the parameters of our model get updated after each minibatch. It's important to note that in the code you posted the training error of each minibatch is computed using a different set of weights.

On the other hand, note that for the validation error, it is being computed with the same set of weights.

And perhaps more importantly, the MLP is being trained with Dropout. When we are computing training error, again, unlike validation error, we do not turn off Dropout. In particular, note that in the code for the validation function we have Deterministic = True while this is absent from the training function.

In particular note that the very purpose of using Dropout is to prevent overfitting, i.e. lower training error than validation error. And to this end, we see that Dropout is doing a good job as it is very easy for even relatively shallow models to overfit the MNIST dataset these days.

So here's what you can try: after each epoch of training, run the val_fn on the training set as well as the validation set. This will come at the cost of an additional full forward pass through the training set. But for MNIST and a simple model like MLP, it isn't going to cost too much in terms of computation and it might be worth it to get your hands dirty modifying some Lasagne code as well as building some general intuition about minibatch + Dropout training.