Solved – Difference between batch_size=1 and SGD optimisers in Keras

keraspythontensorflow

I have a question that to implement stochastic gradient descent we set the batch size=1 what ever be the optimizer.So what does sgd optimizer do that is different then setting batch size=1 using any optimiser. Thank you in advance for helping

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)

vs
model.compile(loss='mean_squared_error', optimizer='rmsprop')
model.fit(X,y,batch_size=1)

Best Answer

In practice, how Keras works is that it decouples the hyperparameters that are really specific to the optimizers (for instance, the learning rate and momentum value) and the training parameters in general, for instance, the number of epochs, batch size, etc. This makes sense because things can be handled more efficiently. That is, just by using,

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

one can define different variants of the Gradient Descent (GD) algorithm, be it, Batch GD where the batch_size = number of training samples (m), Mini-Batch (Stochastic) GD where batch_size = > 1 and < m, and finally the online (Stochastic) GD where batch_size = 1. Here, the batch_size refers to the argument that is to be written in model.fit().

While it is true that, in theory, SGD is nothing but setting batch_size=1, that particular setting has fallen out of favor in the community these days mainly because it is expensive in terms of training time (there are just too many weight updates to be done). Therefore, with how the community has progressed, SGD mostly refers to the mini-batch SGD with batch_size=32, being the default in Keras.

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.

See here, here and here for further information.