Solved – Oscillating validation accuracy for a convolutional neural network

convolutiondeep learningmachine learningneural networks

My CNN training gives me weird validation accuracy result. When it comes to 2.5,3.5,4.5 epochs, the validation accuracy is higher (meaning only need to go over half of the batches and I can reach better accuracy. But, If I go over all batches (one epoch), the validation accuracy drops). I repeat this experiment several times with random subset of data and the result looks similar.

Anything wrong here? When the accuracy is fluctuating? Also, when half cycle of epoch give better accuracy?

I use adadelta to train my network

Best Answer

This is likely due to the ordering of your dataset. If there's many observations of the same class in a sequence the weights of the network will move too far in the direction of classifying this class.

A common cause is if you balance the classes in your dataset by resampling observations and appending them to the dataset. Shuffle your dataset - that should help you avoid the fluctuations in accuracy (and perhaps obtain a higher accuracy overall).

Machine learning in general

In general machine learning scenarios you would use cross-validation to find the optimal combination of your hyperparameters, then fix them and train on the whole training set. In the end, you would evaluate on the test set only to get a realistic idea about its performance on new, unseen data.

If you would then train a different model and select the one of them which performs better on the test set, you are already using the test set as part of your model selection loop, so you would need yet a new, independent test set to evaluate the test performance.

Neural networks

Neural networks are a bit specific in the sense that their training is usually very long, thus cross-validation is not used very often (if training would take 1 day, then doing 10 fold cross validation already takes over a week on a single machine). Moreover, one of the important hyperparameters is the number of training epochs. The optimal length of the training varies with different initializations and different training sets, so fixing number of epochs to one number and then training on all training data (training+validation) for this fixed number is not very reliable approach.

Instead, as you mentioned, some form of early stopping is used: Potentially, the model is trained for a long time, saving "snapshots" periodically, and eventually the "snapshot" with the best performance on some validation set is picked. To enable this, you have to always keep some portion of the validation data aside². Therefore, you will never train the neural net on all of the samples.

Finally, there are plenty of other hyperparameters, such as the learning rate, weight decay, dropout ratios, but also the network architecture itself (depth, number of units, size of conv. kernels, etc.). You could potentially use the same validation set which you use for early stopping to tune these, but then again, you are overfitting to this set by using it for early stopping, so it does give you a biased estimate. Ideal would be, however, using yet another, separate validation set. Once you fix all the remaining hyperparameters, you could merge this second validation set into your final training set.

To wrap it up:

Split all your data into training + validation 1 + validation 2 + testing
Train network on training, use validation 1 for early stopping
Evaluate on validation 2, change hyperparameters, repeat 2.
Select the best hyperparameter combination from 3., train network on training + validation 2, use validation 1 for early stopping
Evaluate on testing. This is your final (real) model performance.

¹ This is exactly the reason why Kaggle challenges have 2 test sets: a public and private one. You can use the public test set to check the performance of your model, but eventually it is the performance on the private test set that matters, and if you overfit to the public test set, you lose.

² Amari et al. (1997) in their article Asymptotic Statistical Theory of Overtraining and Cross-Validation recommend setting the ratio of samples used for early stopping to $1/\sqrt{2N}$, where $N$ is the size of the training set.

Best Answer

Related Solutions

Solved – How to use early stopping properly for training deep neural network

Solved – How to correctly use validation and test sets for Neural Network training

Machine learning in general

Neural networks

To wrap it up:

Related Question