Solved – Stochastic gradient descent Vs Mini-batch size 1

gradient descentmachine learningneural networksstochastic gradient descent

Is stochastic gradient descent basically the name given to mini-batch training where batch size = 1 and selecting random training rows? i.e. it is the same as 'normal' gradient descent, it's just the manner in which the training data is supplied that makes the difference?

One thing that confuses me is I've seen people say that even with SGD you can supply more than 1 data point, and have larger batches, so won't that just make it 'normal' mini-batch gradient descent?

Best Answer

Standard gradient descent and batch gradient descent were originally used to describe taking the gradient over all data points, and by some definitions, mini-batch corresponds to taking a small number of data points (the mini-batch size) to approximate the gradient in each iteration. Then officially, stochastic gradient descent is the case where the mini-batch size is 1.

However, perhaps in an attempt to not use the clunky term "mini-batch", stochastic gradient descent almost always actually refers to mini-batch gradient descent, and we talk about the "batch-size" to refer to the mini-batch size. Gradient descent with > 1 batch size is still stochastic, so I think it's not an unreasonable renaming, and pretty much no one uses true SGD with a batch size of 1, so nothing of value was lost.