What is meant by a batch normalisation being bypassed is that it does not normalise the activations by batch statistics.
In the newer versions (since beta-18 I believe), the population statistics are computed during training as another parameter and then used during test time (e.g. see documentation here and here).
What is meant by removing the batch-normalisation is to apply the additive and multiplicative constants to the closest convolution layer. You can see the way how it can be done in the cnn-imagenet-deploy script in examples for imagenet.
Sorry for the misunderstanding. We will update the documentation to make it more clear.
Determining the number of epochs by e.g. averaging the number of epochs for the folds and use it for the test run later on?
Shortest possible answer: Yes!
But let me add some context...
I believe you are referring to Section 7.8, pages 246ff, on Early Stopping in the Deep Learning book. The described procedure there, however, is significantly different from yours. Goodfellow et al. suggest to split your data in three sets first: a training, dev, and test set. Then, you train (on the training set) until the error from that model increases (on the dev set), at which point you stop. Finally, you use the trained model that had the lowest dev set error and evaluate it on the test set. No cross-validation involved at all.
However, you seem to be trying to do both early stopping (ES) and cross-validation (CV), as well as model evaluation all on the same set. That is, you seem to be using all your data for CV, training on each split with ES, and then using the average performance over those CV splits as your final evaluation results. If that is the case, that indeed is stark over-fitting (and certainly not what is described by Goodfellow et al.), and your approach gives you exactly the opposite result of what ES is meant for -- as a regularization technique to prevent over-fitting. If it is not clear why: Because you've "peaked" at your final evaluation instances during training time to figure out when to ("early") stop training; That is, you are optimizing against the evaluation instances during training, which is (over-) fitting your model (on that evaluation data), by definition.
So by now, I hope to have answered your other [two] questions.
The answer by the higgs broson (to your last question, as cited above) already gives a meaningful way to combine CV and ES to save you some training time: You could split your full data in two sets only - a dev and a test set - and use the dev set to do CV while applying ES on each split. That is, you train on each split of your dev set, and stop once the lowest error on the training instances you set aside for evaluating that split has been reached [1]. Then you average the number of epochs needed to reach that lowest error from each split and train on the full dev set for that (averaged) number of epochs. Finally, you validate that outcome on the test set you set aside and haven't touched yet.
[1] Though unlike the higgs broson I would recommend to evaluate after every epoch. Two reasons for that: (1), comparative to training, the evaluation time will be negligible. (2), imagine your min. error is at epoch 51, but you evaluate at epoch 50 and 60. It isn't unlikely that the error at epoch 60 will be lower than at epoch 50; Yet, you would choose 60 as your epoch parameter, which clearly is sub-optimal and in fact even going a bit against the purpose of using ES in the first place.
Best Answer
Parameter count isn't the only thing you need to track for measuring memory usage. You also need to store the data as it goes through the network, which could be require a lot of storage, depending on what you're doing. Consider a large number of small filters applied to a large image. The memory consumption of the filters in total is small, but storing lots of filtered versions of a large input image will consume a lot of memory.