What would be a good validation frequency? Should I check my model on the validation data at the end of each epoch? (My batch size is 1)
There is no gold rule, computing the validation error after each epoch is quite common. Since your validation set much smaller than your training set, it will not slow down the training much.
Is it the case that the first few epochs might yield worse result before it starts converging to better value?
yes
In that case, should we train our network for several epochs before checking for early stopping?
You could, but then the issue is how many epochs should you skip. So in practice, most of the time people do not skip any epoch.
How to handle the case when the validation loss might go up and down? In that case, early stopping might prevent my model from learning further, right?
People typically define a patience, i.e. the number of epochs to wait before early stop if no progress on the validation set. The patience is often set somewhere between 10 and 100 (10 or 20 is more common), but it really depends on your dataset and network.
Example with patience = 10:
![enter image description here](https://i.stack.imgur.com/No38I.png)
The bottom line is:
As soon as you use a portion of the data to choose which model performs better, you are already biasing your model towards that data.1
Machine learning in general
In general machine learning scenarios you would use cross-validation to find the optimal combination of your hyperparameters, then fix them and train on the whole training set. In the end, you would evaluate on the test set only to get a realistic idea about its performance on new, unseen data.
If you would then train a different model and select the one of them which performs better on the test set, you are already using the test set as part of your model selection loop, so you would need yet a new, independent test set to evaluate the test performance.
Neural networks
Neural networks are a bit specific in the sense that their training is usually very long, thus cross-validation is not used very often (if training would take 1 day, then doing 10 fold cross validation already takes over a week on a single machine). Moreover, one of the important hyperparameters is the number of training epochs. The optimal length of the training varies with different initializations and different training sets, so fixing number of epochs to one number and then training on all training data (training+validation) for this fixed number is not very reliable approach.
Instead, as you mentioned, some form of early stopping is used: Potentially, the model is trained for a long time, saving "snapshots" periodically, and eventually the "snapshot" with the best performance on some validation set is picked. To enable this, you have to always keep some portion of the validation data aside2. Therefore, you will never train the neural net on all of the samples.
Finally, there are plenty of other hyperparameters, such as the learning rate, weight decay, dropout ratios, but also the network architecture itself (depth, number of units, size of conv. kernels, etc.). You could potentially use the same validation set which you use for early stopping to tune these, but then again, you are overfitting to this set by using it for early stopping, so it does give you a biased estimate. Ideal would be, however, using yet another, separate validation set. Once you fix all the remaining hyperparameters, you could merge this second validation set into your final training set.
To wrap it up:
- Split all your data into
training
+ validation 1
+ validation 2
+ testing
- Train network on
training
, use validation 1
for early stopping
- Evaluate on
validation 2
, change hyperparameters, repeat 2.
- Select the best hyperparameter combination from 3., train network on
training
+ validation 2
, use validation 1
for early stopping
- Evaluate on
testing
. This is your final (real) model performance.
1 This is exactly the reason why Kaggle challenges have 2 test sets: a public and private one. You can use the public test set to check the performance of your model, but eventually it is the performance on the private test set that matters, and if you overfit to the public test set, you lose.
2 Amari et al. (1997) in their article Asymptotic Statistical Theory of Overtraining and Cross-Validation recommend setting the ratio of samples used for early stopping to $1/\sqrt{2N}$, where $N$ is the size of the training set.
Best Answer
This is likely due to the ordering of your dataset. If there's many observations of the same class in a sequence the weights of the network will move too far in the direction of classifying this class.
A common cause is if you balance the classes in your dataset by resampling observations and appending them to the dataset. Shuffle your dataset - that should help you avoid the fluctuations in accuracy (and perhaps obtain a higher accuracy overall).