Solved – Training performance jumps up after epoch, Dev performance jumps down

machine learningneural networks

I train a neural network on some data using minibatches. I evaluate it as follows: for the training set, I check the performance only on the current minibatch. For the Dev (or test) set, I check the performance on the entire set.

I get two phenomena (image attached, each point is the evaluation after a minibatch finishes, red arrows denote start of new epochs):

For the training error, at the start of each new epoch I get a serious 'bump' up in performance (i.e. overfitting). I assume this is because of my incomplete evaluation – I don't evaluate the entire training set each minibatch but only the current minibatch.
For the Dev error – which is my real concern – I get some deterioration after immediately at the start of the second epoch (and the phenomena repeats itself on future epochs). While I know that the Dev performance should indeed diminish at some point, I don't understand why it happens exactly on the start of an epoch.

Is this reasonable (especially item 2.)? or do I have some bug in my code?

Thanks.

Best Answer

Yes, this reduction is normal

When training any kind of machine learning algorithm, if you continue training, your algorithm overfits the training set and starts learning the details of the noise in the training set instead of utilizing the generalizable information. When it does this, your algorithm often loses it's generality and gets worse on other similar sets, specifically your dev set.

I have seen jumps like this myself when fitting an algorithm to my training set. It finds a pattern in the noise of the test set, but this pattern does not generalize to the test set.

There are a few ways to reduce overfitting:

Cross-validation - Hold out a subset of your training data (most use ~10%) and then use that to compare your algorithm and find a good stopping point. This prevents you from potentially overfitting your dev set as well.
Regularization - Add a penalty to your loss function for each NN node, so that it focuses on the most important connections.
Dropout - Randomly drop various connections as your train each epoch. This forces the NN to reduce reliance on particular connection webs and makes each node more robust by itself since it might not be able to rely on other nodes that could potentially be dropped out of each epoch.
Stop training at the point when the test error starts decreasing successively. That might be the point at which the algorithm has learned what it can and it is no longer beneficial to train with the training data that you have.

I hope this helps!

Best Answer

Related Solutions

Solved – overtrain the CNN

Batch Size – How it Affects Convergence of SGD

Related Question