"To produce the estimates on the test set do I simply average the weights and biases from each of the 10 different calibrated models and use this parametrization to produce outputs to compare with my test set for the target function?"
No. Cross-validation is a procedure for estimating the test performance of a method of producing a model, rather than of the model itself. So the best thing to do is to perform k-fold cross-validation to determine the best hyper-parameter settings, e.g. number of hidden units, values of regularisation parameters etc. Then train a single network on the whole calibration set (or several and pick the one with the best value of the regularised training criterion to guard against local minima). Evaluate the performance of that model using the test set.
In the case of neural networks, averaging the weights and biases of individual models won't work as different models will choose different internal representations, so the corresponding hidden units of different networks will represent different (distributed) concepts. If you average their weights, they mean of these concepts will be meaningless.
Determining the number of epochs by e.g. averaging the number of epochs for the folds and use it for the test run later on?
Shortest possible answer: Yes!
But let me add some context...
I believe you are referring to Section 7.8, pages 246ff, on Early Stopping in the Deep Learning book. The described procedure there, however, is significantly different from yours. Goodfellow et al. suggest to split your data in three sets first: a training, dev, and test set. Then, you train (on the training set) until the error from that model increases (on the dev set), at which point you stop. Finally, you use the trained model that had the lowest dev set error and evaluate it on the test set. No cross-validation involved at all.
However, you seem to be trying to do both early stopping (ES) and cross-validation (CV), as well as model evaluation all on the same set. That is, you seem to be using all your data for CV, training on each split with ES, and then using the average performance over those CV splits as your final evaluation results. If that is the case, that indeed is stark over-fitting (and certainly not what is described by Goodfellow et al.), and your approach gives you exactly the opposite result of what ES is meant for -- as a regularization technique to prevent over-fitting. If it is not clear why: Because you've "peaked" at your final evaluation instances during training time to figure out when to ("early") stop training; That is, you are optimizing against the evaluation instances during training, which is (over-) fitting your model (on that evaluation data), by definition.
So by now, I hope to have answered your other [two] questions.
The answer by the higgs broson (to your last question, as cited above) already gives a meaningful way to combine CV and ES to save you some training time: You could split your full data in two sets only - a dev and a test set - and use the dev set to do CV while applying ES on each split. That is, you train on each split of your dev set, and stop once the lowest error on the training instances you set aside for evaluating that split has been reached [1]. Then you average the number of epochs needed to reach that lowest error from each split and train on the full dev set for that (averaged) number of epochs. Finally, you validate that outcome on the test set you set aside and haven't touched yet.
[1] Though unlike the higgs broson I would recommend to evaluate after every epoch. Two reasons for that: (1), comparative to training, the evaluation time will be negligible. (2), imagine your min. error is at epoch 51, but you evaluate at epoch 50 and 60. It isn't unlikely that the error at epoch 60 will be lower than at epoch 50; Yet, you would choose 60 as your epoch parameter, which clearly is sub-optimal and in fact even going a bit against the purpose of using ES in the first place.
Best Answer
Your pseudocode looks right to me. It's also possible to use cross-validation at the top level as well, giving you a double loop. I.e. in the outer loop you create train/test sets from folds, then in the inner loop you further break the train set down into train/validate portions. I would only recommend this if your dataset if very small. It would reduce the variance in the estimate of final best model performance, at the cost of 10x running time.
Some CV implementations use heuristics to choose the model. Instead of taking the one with the lowest validation loss, they take the one with the lowest complexity (in some sense, such as the number of hidden nodes), that is within 1 or 2 SE of the best.