First, I think you're mistaken about what the three partitions do. You don't make any choices based on the test data. Your algorithms adjust their parameters based on the training data. You then run them on the validation data to compare your algorithms (and their trained parameters) and decide on a winner. You then run the winner on your test data to give you a forecast of how well it will do in the real world.
You don't validate on the training data because that would overfit your models. You don't stop at the validation step's winner's score because you've iteratively been adjusting things to get a winner in the validation step, and so you need an independent test (that you haven't specifically been adjusting towards) to give you an idea of how well you'll do outside of the current arena.
Second, I would think that one limiting factor here is how much data you have. Most of the time, we don't even want to split the data into fixed partitions at all, hence CV.
Here is a possible explanation (could very well be wrong however), maybe you could try modifying their tutorial code to see if this works or not?
For minibatch descent methods the parameters of our model get updated after each minibatch. It's important to note that in the code you posted the training error of each minibatch is computed using a different set of weights.
On the other hand, note that for the validation error, it is being computed with the same set of weights.
And perhaps more importantly, the MLP is being trained with Dropout. When we are computing training error, again, unlike validation error, we do not turn off Dropout. In particular, note that in the code for the validation function we have Deterministic = True while this is absent from the training function.
In particular note that the very purpose of using Dropout is to prevent overfitting, i.e. lower training error than validation error. And to this end, we see that Dropout is doing a good job as it is very easy for even relatively shallow models to overfit the MNIST dataset these days.
So here's what you can try: after each epoch of training, run the val_fn on the training set as well as the validation set. This will come at the cost of an additional full forward pass through the training set. But for MNIST and a simple model like MLP, it isn't going to cost too much in terms of computation and it might be worth it to get your hands dirty modifying some Lasagne code as well as building some general intuition about minibatch + Dropout training.
Best Answer
Training error tends to be lower than cross validation error. This is an intuitive explanation, ignoring the random effects: In cross validation, you divide the train set T into two parts T1 and T2, train on T1 and test on T2. You tune the parameter to minimize the error on T2, but the validation error on T2 tends to be higher than the train error on T1 : $$er(T1)<e(T2)$$ because you train the model on T1 and have more opportunity to fit the model on it. On the other hand, er(T1)~er(T) as you train the model with the same tuned parameter on T1 and T. All together $$er(T2)>er(T)$$ which is what you also see in the diagram.