Solved – Does memory ever really matter for mini-batch size selection

gradient descenthyperparametermachine learningneural networksstochastic gradient descent

I'm new to machine learning, and am confused about some aspects of stochastic gradient decent.

I've read in several places that, when using vectorized code, the reason that mini-batching in stochastic gradient descent is necessary is to avoid performance degradation when memory limits are hit for too-large mini-batches. But in my code, long before I see any such degradation in processing rate, s/epoch, I see a completely different effect: dramatic decreases in learning rate.

enter image description here

This is no surprise, since for nearly flat processing rate gradient descent rates will be inversely proportional to batch size: s/mini-bach must be increasing. And since (for fixed $\eta$) it is this rate that determines the learning rate, larger mini-batches must — barring some dramatic unexpected and unobserved increase in processing rate with mini-batch size — always be worse than that for smaller batches.

enter image description here

(Moreover any per-epoch analysis of preformance, especially of the training set, will add additional time that grows with mini-batch size.)

It seems then that choosing a mini-batch size depends entirely on selecting a value large enough for vectorization to "kick in" (very small batches perform poorly in that regard) and not going any further than that, except to reduce learning noise.

Is this, in fact, how one chooses a batch size in practice: by picking the lowest possible "well behaved" batch size (e.g. perhaps where the "level off" point is for processing rate) for one’s algorithm, and using that for all networks and training sets? If so, will that number remain the same for my code on all hardware, or is it an unpredictable function of hardware and software? And — more importantly — what of the story that the reason for mini-batching is to avoid memory issues?


This is consistent with what I see across the board, for both training and test accuracies, for a range of network sizes (the number is the number nodes in the hidden layer of a simple 3-layer "beginner" network):

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here
enter image description here

enter image description here
enter image description here
enter image description here
enter image description here
enter image description here

Best Answer

Deciding on a batch size, or more general, deciding on using batch learning or breaking the training data in min-batches is more towards the very question of how fast the model will converge in total, and how biased/over-fitting my model would end-up being? Of course hardware limitation is another factor. E.g. if you have a training data of size 64GB, then batch learning on normal machines is not an optimum option, you get much better processing performance training on mini-batches which could fit the memory not having to transfer data back and force in RAM in each iteration. This question explains it very well.

You have not mentioned which training data and which cost minimization algorithm you used in your analysis, but all of these factors define how any regime would perform.

But limitations aside, you could look at this problem in a slightly different way. What gives the edge to learning on larger batches is the simple fact that your implementation could vectorize the training set and perform matrix calculations in parallel, then to the extent that your processing unit could parallelize these operations you get shorter epochs (something like numpy in python uses implementations which could do it). Here your algorithm and processing capability will define the optimum batch size.

In stochastic learning you could achieve distributed processing simply by distributing the mini-batches across multiple processing units. Additionally, in stochastic learning, we randomly pick the examples in each iteration to avoid a biased model. It can be proven that this regime could relatively converge to the global optima faster than batch learning (in less iterations) since the gradients are updated after training on less number of examples.