Here is a possible explanation (could very well be wrong however), maybe you could try modifying their tutorial code to see if this works or not?
For minibatch descent methods the parameters of our model get updated after each minibatch. It's important to note that in the code you posted the training error of each minibatch is computed using a different set of weights.
On the other hand, note that for the validation error, it is being computed with the same set of weights.
And perhaps more importantly, the MLP is being trained with Dropout. When we are computing training error, again, unlike validation error, we do not turn off Dropout. In particular, note that in the code for the validation function we have Deterministic = True while this is absent from the training function.
In particular note that the very purpose of using Dropout is to prevent overfitting, i.e. lower training error than validation error. And to this end, we see that Dropout is doing a good job as it is very easy for even relatively shallow models to overfit the MNIST dataset these days.
So here's what you can try: after each epoch of training, run the val_fn on the training set as well as the validation set. This will come at the cost of an additional full forward pass through the training set. But for MNIST and a simple model like MLP, it isn't going to cost too much in terms of computation and it might be worth it to get your hands dirty modifying some Lasagne code as well as building some general intuition about minibatch + Dropout training.
A model is considered to be 'underfitting' when both the training and validation error are high. When your validation error is lower than your training error than the fit is obscure and inconclusive.
There are multiple reasons why this can happen but one can't be completely certain. It could be because the test cases during the validation belonged to the type of data that influenced the model the most while training or maybe because the test cases contain the type of data that your model is good at predicting. If there is a very small difference, it could be random.
Although, based on your question it is difficult to comment more specifically without knowing the train-test split method or the evaluation metric.
Regarding underfitting and overfitting you can keep these in mind:
- Underfitting: High Validation and Training Error
- Overfitting: High Validation Error and Low Training Error
Best Answer
It is difficult to be certain without knowing your actual methodology (e.g. cross-validation method, performance metric, data splitting method, etc.).
Generally speaking though, training error will almost always underestimate your validation error. However it is possible for the validation error to be less than the training. You can think of it two ways:
That is why it is important that you really evaluate your model training methodology. If you don't split your data for training properly your results will lead to confusing, if not simply incorrect, conclusions.
I think of model evaluation in four different categories:
Underfitting – Validation and training error high
Overfitting – Validation error is high, training error low
Good fit – Validation error low, slightly higher than the training error
Unknown fit - Validation error low, training error 'high'
I say 'unknown' fit because the result is counter intuitive to how machine learning works. The essence of ML is to predict the unknown. If you are better at predicting the unknown than what you have 'learned', AFAIK the data between training and validation must be different in some way. This could mean you either need to reevaluate your data splitting method, adding more data, or possibly changing your performance metric (are you actually measuring the performance you want?).
EDIT
To address the OP's reference to a previous python lasagne question.
This suggests that you have sufficient data to not require cross-validation and simply have your training, validation, and testing data subsets. Now, if you look at the lasagne tutorial you can see that the same behavior is seen at the top of the page. I would find it hard to believe the authors would post such results if it was strange but instead of just assuming they are correct let's look further. The section of most interest to us here is in the training loop section, just above the bottom you will see how the loss parameters are calculated.
The training loss is calculated over the entire training dataset. Likewise, the validation loss is calculated over the entire validation dataset. The training set is typically at least 4 times as large as the validation (80-20). Given that the error is calculated over all samples, you could expect up to approximately 4X the loss measure of the validation set. You will notice, however, that the training loss and validation loss are approaching one another as training continues. This is intentional as if your training error begins to get lower than your validation error you would be beginning to overfit your model!!!
I hope this clarifies these errors.