I suspect this is what Bishop means:
If you think of a neural net as a function that maps inputs to an output, then when you first initialize a neural net with small random weights, the neural net looks a lot like a linear function. The sigmoid activation function is close to linear around zero (just do a Taylor expansion), and small incoming weights will guarantee that the effective domaine of each hidden unit is just a small interval around zero, so the entire neural net, regardless of how many layers you have, will look very much like a linear function. So you can heuristically describe the neural net as having a small number of degrees of freedom (equal to the dimension of the input). As you train the neural net, the weights can become arbitrarily large, and the neural net can better approximate arbitrary non-linear functions. So as training progresses, you can heuristically describe that change as an increase in the number of degrees of freedom, or, more specifically, in increase in the size of the class of functions that the neural net can closely approximate.
Determining the number of epochs by e.g. averaging the number of epochs for the folds and use it for the test run later on?
Shortest possible answer: Yes!
But let me add some context...
I believe you are referring to Section 7.8, pages 246ff, on Early Stopping in the Deep Learning book. The described procedure there, however, is significantly different from yours. Goodfellow et al. suggest to split your data in three sets first: a training, dev, and test set. Then, you train (on the training set) until the error from that model increases (on the dev set), at which point you stop. Finally, you use the trained model that had the lowest dev set error and evaluate it on the test set. No cross-validation involved at all.
However, you seem to be trying to do both early stopping (ES) and cross-validation (CV), as well as model evaluation all on the same set. That is, you seem to be using all your data for CV, training on each split with ES, and then using the average performance over those CV splits as your final evaluation results. If that is the case, that indeed is stark over-fitting (and certainly not what is described by Goodfellow et al.), and your approach gives you exactly the opposite result of what ES is meant for -- as a regularization technique to prevent over-fitting. If it is not clear why: Because you've "peaked" at your final evaluation instances during training time to figure out when to ("early") stop training; That is, you are optimizing against the evaluation instances during training, which is (over-) fitting your model (on that evaluation data), by definition.
So by now, I hope to have answered your other [two] questions.
The answer by the higgs broson (to your last question, as cited above) already gives a meaningful way to combine CV and ES to save you some training time: You could split your full data in two sets only - a dev and a test set - and use the dev set to do CV while applying ES on each split. That is, you train on each split of your dev set, and stop once the lowest error on the training instances you set aside for evaluating that split has been reached [1]. Then you average the number of epochs needed to reach that lowest error from each split and train on the full dev set for that (averaged) number of epochs. Finally, you validate that outcome on the test set you set aside and haven't touched yet.
[1] Though unlike the higgs broson I would recommend to evaluate after every epoch. Two reasons for that: (1), comparative to training, the evaluation time will be negligible. (2), imagine your min. error is at epoch 51, but you evaluate at epoch 50 and 60. It isn't unlikely that the error at epoch 60 will be lower than at epoch 50; Yet, you would choose 60 as your epoch parameter, which clearly is sub-optimal and in fact even going a bit against the purpose of using ES in the first place.
Best Answer
Adding any regularization (including L2) will increase the error on training set. This is exactly the point of the regularization, where we increase bias and reduce the variance of the model. Hopefully, if we regularized well, as a result, the testing error will be reduced with the regularization.
Here are some related topics.
What problem do shrinkage methods solve?
How to know if a learning curve from SVM model suffers from bias or variance?