Deciding on a batch size, or more general, deciding on using batch learning or breaking the training data in min-batches is more towards the very question of how fast the model will converge in total, and how biased/over-fitting my model would end-up being? Of course hardware limitation is another factor. E.g. if you have a training data of size 64GB, then batch learning on normal machines is not an optimum option, you get much better processing performance training on mini-batches which could fit the memory not having to transfer data back and force in RAM in each iteration.
This question explains it very well.
You have not mentioned which training data and which cost minimization algorithm you used in your analysis, but all of these factors define how any regime would perform.
But limitations aside, you could look at this problem in a slightly different way. What gives the edge to learning on larger batches is the simple fact that your implementation could vectorize the training set and perform matrix calculations in parallel, then to the extent that your processing unit could parallelize these operations you get shorter epochs (something like numpy in python uses implementations which could do it). Here your algorithm and processing capability will define the optimum batch size.
In stochastic learning you could achieve distributed processing simply by distributing the mini-batches across multiple processing units. Additionally, in stochastic learning, we randomly pick the examples in each iteration to avoid a biased model. It can be proven that this regime could relatively converge to the global optima faster than batch learning (in less iterations) since the gradients are updated after training on less number of examples.
Sure one update with a big minibatch is "better" (in terms of accuracy) than one update with a small minibatch. This can be seen in the table you copied in your question (call $N$ the sample size):
- batch size 1: number of updates $27N$
- batch size 20,000: number of updates $8343\times\frac{N}{20000}\approx 0.47N$
You can see that with bigger batches you need much fewer updates for the same accuracy.
But it can't be compared because it's not processing the same amount of data. I'm quoting the first article:
"We compare the effect of executing $k$ SGD iterations with small minibatches
$B_j$ versus a single iteration with a large minibatch $\displaystyle\bigcup_{1\leq j\leq k} B_j$"
Here it's about processing the same amount of data and while there is small overhead for multiple mini-batches, this takes comparable processing resources.
There are several ways to understand why several updates is better (for the same amount of data being read). It's the key idea of stochastic gradient descent vs. gradient descent. Instead of reading everything and then correct yourself at the end, you correct yourself on the way, making the next reads more useful since you correct yourself from a better guess. Geometrically, several updates is better because you are drawing several segments, each in the direction of the (approximated) gradient at the start of each segment. while a single big update is a single segment from the very start in the direction of the (exact) gradient. It's better to change direction several times even if the direction is less precise.
The size of mini-batches is essentially the frequency of updates: the smaller minibatches the more updates. At one extreme (minibatch=dataset) you have gradient descent. At the other extreme (minibatch=one line) you have full per line SGD. Per line SGD is better anyway, but bigger minibatches are suited for more efficient parallelization.
At the end of the convergence process, SGD becomes less precise than (batch) GD. But at this point, things become (usually) a sort of uselessly precise fitting. While you get a slightly smaller loss function on the training set, you don't get real predictive power. You are only looking for the very precise optimum but it does not help. If the loss function is correctly regularized (which prevents over-fitting) you don't exactly "over"-fit, you just uselessly "hyper"-fit. This shows as a non significant change in accuracy on the test set.
Best Answer
Yes. Before mini-batching existed, SGD referred specifically to batch size equal to one.
You can actually use a bigger batch size though, you just need to add gradients from sequential samples within a batch. This is called Gradient Accumulation. See link.