Think of accuracy on the validation set as an estimate of accuracy on future data, given the value of some hyperparameter. In this case, the hyperparameter of interest is the number of training epochs. So: for each CV fold, train the network (e.g. up to some maximum number of epochs). After each epoch, record accuracy on the validation set. Compute the average validation set accuracy (across CV folds) for each number of epochs. Choose the number of epochs that maximizes this value.
Determining the number of epochs by e.g. averaging the number of epochs for the folds and use it for the test run later on?
Shortest possible answer: Yes!
But let me add some context...
I believe you are referring to Section 7.8, pages 246ff, on Early Stopping in the Deep Learning book. The described procedure there, however, is significantly different from yours. Goodfellow et al. suggest to split your data in three sets first: a training, dev, and test set. Then, you train (on the training set) until the error from that model increases (on the dev set), at which point you stop. Finally, you use the trained model that had the lowest dev set error and evaluate it on the test set. No cross-validation involved at all.
However, you seem to be trying to do both early stopping (ES) and cross-validation (CV), as well as model evaluation all on the same set. That is, you seem to be using all your data for CV, training on each split with ES, and then using the average performance over those CV splits as your final evaluation results. If that is the case, that indeed is stark over-fitting (and certainly not what is described by Goodfellow et al.), and your approach gives you exactly the opposite result of what ES is meant for -- as a regularization technique to prevent over-fitting. If it is not clear why: Because you've "peaked" at your final evaluation instances during training time to figure out when to ("early") stop training; That is, you are optimizing against the evaluation instances during training, which is (over-) fitting your model (on that evaluation data), by definition.
So by now, I hope to have answered your other [two] questions.
The answer by the higgs broson (to your last question, as cited above) already gives a meaningful way to combine CV and ES to save you some training time: You could split your full data in two sets only - a dev and a test set - and use the dev set to do CV while applying ES on each split. That is, you train on each split of your dev set, and stop once the lowest error on the training instances you set aside for evaluating that split has been reached [1]. Then you average the number of epochs needed to reach that lowest error from each split and train on the full dev set for that (averaged) number of epochs. Finally, you validate that outcome on the test set you set aside and haven't touched yet.
[1] Though unlike the higgs broson I would recommend to evaluate after every epoch. Two reasons for that: (1), comparative to training, the evaluation time will be negligible. (2), imagine your min. error is at epoch 51, but you evaluate at epoch 50 and 60. It isn't unlikely that the error at epoch 60 will be lower than at epoch 50; Yet, you would choose 60 as your epoch parameter, which clearly is sub-optimal and in fact even going a bit against the purpose of using ES in the first place.
Best Answer
I cannot say I fully understand your pseudocode, however, the usual procedure is following:
Test set: Take a part of your data (if needed, use stratified sampling or similar technique to ensure this test set is a good representative of your data) and put it aside. Do not use this data for learning nor model selection.
Parameter tuning: Use k-fold cross validation to find the best parameters.
Training: Train the model with parameters selected in step 2 on the whole dataset.
Testing: See how your model performs on the testing data put aside in step 1.
I hope this should answer both of your questions. But, more specifically and using some fancy-like math notation:
Split your training data $D$ into $K$ splits, $\{D_k; k\in[1;K]\}$.
Repeat $K$-times: (kinda
for k in range(K)
)a. Train the network using splits $\bigcup D_{j; j\neq k}$ as training data.
b. Evaluate the trained network on the remaining part, $D_k$. Do not use this part for evaluation after each epoch. If you want to do early stopping, define yet another validation set for each $\bigcup D_{j; j\neq k}$.
I think your second code sample reflects this.
Regarding your second question, always keep a separate test set.
Also, this is a very common topic on this site. See some related answers: