I am not completely clear of what the question is asking, but I think the answer is no. The thing you need to think hard about with cross-validation is that no part of your algorithm can have any access to the test set. If it does, then your cross-validation results will be tainted and not be an accurate measure of the 'true' error.
From your question, I assume you are using some kind of iterative learning algorithm such as GBM and you are using the validation set as a way of determining when your GBM has enough models in its ensemble and has started to overfit. If this is true, then what you are doing is not optimal.
The way to think of this is that the stopping criteria is part of your learning algorithm. If it is part of the algorithm, then it can't use the test set in any way.
You may need to do nested cross-validation. In your outer loop, you divide into test and training sets, then in your inner loop you further divide the training set into sub test and training sets and proceed as you have. The inner loop cross-validation can be used to learn from that training set when to stop the learning, but to get an accurate generalization error you then need to apply that to the test set from the outer loop that hasn't yet been touched by the inner loop whose aim was to find, from the training data, when the best time to stop is. To be clear, say the inner loop cross-validation found that the best number of iterations was 10. In your outer loop you learn a model using the full outer loop training set, iterating 10 times, then see how that performs on the test set.
Does this make sense?
Note that depending on the models in use and the dataset, this may or may not be a big issue. The downside is that nested cross-validation can be very computationally expensive. Doing things the way you have been may well be an appropriate trade-off between accuracy and computational time in your circumstance. The most rigid answer to your question is no, it is not completely valid cross-validation. Whether it is passable for your circumstances is a different question.
Determining the number of epochs by e.g. averaging the number of epochs for the folds and use it for the test run later on?
Shortest possible answer: Yes!
But let me add some context...
I believe you are referring to Section 7.8, pages 246ff, on Early Stopping in the Deep Learning book. The described procedure there, however, is significantly different from yours. Goodfellow et al. suggest to split your data in three sets first: a training, dev, and test set. Then, you train (on the training set) until the error from that model increases (on the dev set), at which point you stop. Finally, you use the trained model that had the lowest dev set error and evaluate it on the test set. No cross-validation involved at all.
However, you seem to be trying to do both early stopping (ES) and cross-validation (CV), as well as model evaluation all on the same set. That is, you seem to be using all your data for CV, training on each split with ES, and then using the average performance over those CV splits as your final evaluation results. If that is the case, that indeed is stark over-fitting (and certainly not what is described by Goodfellow et al.), and your approach gives you exactly the opposite result of what ES is meant for -- as a regularization technique to prevent over-fitting. If it is not clear why: Because you've "peaked" at your final evaluation instances during training time to figure out when to ("early") stop training; That is, you are optimizing against the evaluation instances during training, which is (over-) fitting your model (on that evaluation data), by definition.
So by now, I hope to have answered your other [two] questions.
The answer by the higgs broson (to your last question, as cited above) already gives a meaningful way to combine CV and ES to save you some training time: You could split your full data in two sets only - a dev and a test set - and use the dev set to do CV while applying ES on each split. That is, you train on each split of your dev set, and stop once the lowest error on the training instances you set aside for evaluating that split has been reached [1]. Then you average the number of epochs needed to reach that lowest error from each split and train on the full dev set for that (averaged) number of epochs. Finally, you validate that outcome on the test set you set aside and haven't touched yet.
[1] Though unlike the higgs broson I would recommend to evaluate after every epoch. Two reasons for that: (1), comparative to training, the evaluation time will be negligible. (2), imagine your min. error is at epoch 51, but you evaluate at epoch 50 and 60. It isn't unlikely that the error at epoch 60 will be lower than at epoch 50; Yet, you would choose 60 as your epoch parameter, which clearly is sub-optimal and in fact even going a bit against the purpose of using ES in the first place.
Best Answer
Similar thing has been already discussed in the question:
To summarize the result, you should probably keep a few samples aside and use them as a validation set: The benefit from knowing whether your model is still improving and not overfitting yet is probably going to outweight the benefit of having a few samples more for training.
Also, don't forget that if you change the size of the training set, using "epoch count" stops making sense (see the above thread).
Alternatively, see also OAA mentioned in this answer.