Solved – Difference between batch_size=1 and SGD optimisers in Keras

keraspythontensorflow

I have a question that to implement stochastic gradient descent we set the batch size=1 what ever be the optimizer.So what does sgd optimizer do that is different then setting batch size=1 using any optimiser. Thank you in advance for helping

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='mean_squared_error', optimizer=sgd)

vs
model.compile(loss='mean_squared_error', optimizer='rmsprop')
model.fit(X,y,batch_size=1)

Best Answer

In practice, how Keras works is that it decouples the hyperparameters that are really specific to the optimizers (for instance, the learning rate and momentum value) and the training parameters in general, for instance, the number of epochs, batch size, etc. This makes sense because things can be handled more efficiently. That is, just by using,

sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)

one can define different variants of the Gradient Descent (GD) algorithm, be it, Batch GD where the batch_size = number of training samples (m), Mini-Batch (Stochastic) GD where batch_size = > 1 and < m, and finally the online (Stochastic) GD where batch_size = 1. Here, the batch_size refers to the argument that is to be written in model.fit().

While it is true that, in theory, SGD is nothing but setting batch_size=1, that particular setting has fallen out of favor in the community these days mainly because it is expensive in terms of training time (there are just too many weight updates to be done). Therefore, with how the community has progressed, SGD mostly refers to the mini-batch SGD with batch_size=32, being the default in Keras.

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.

See here, here and here for further information.

Related Solutions

Solved – Struggling to train a MLP using Keras (Python)

If you want to model a sinusoid, I think that a stateful LSTM (RNN) might be a more natural choice. You can find an excellent example of modelling a sinusoid with an exponential amplitude decay in the keras example.

However, I tried out your Keras code, and I think your problem is that you're not letting it train long enough. Look at your loss at epoch 250, its VERY high!!

Epoch 250/250
360/360 [==============================] - 0s - loss: 0.5291 - val_loss: 0.7775

If I changed the number of nodes in your hidden layer to 10 and let it run for 15000 epochs instead of 250, I found that the loss was considerably lower and the plot more what you expect.

Epoch 15000/15000
360/360 [==============================] - 0s - loss: 0.2434 - val_loss: 0.2638

The updated code looks like:

# Multilayer Perceptron
model = Sequential()    # Feedforward
model.add(Dense(10, input_dim=1))
model.add(Activation('tanh'))
model.add(Dense(1))
model.compile('sgd', 'mse')

hist = model.fit(xtr, ttr, validation_split=0.1, nb_epoch=15000)

Solved – How to set mini-batch size in SGD in keras

Yes you are right. In Keras batch_size refers to the batch size in Mini-batch Gradient Descent. If you want to run a Batch Gradient Descent, you need to set the batch_size to the number of training samples. Your code looks perfect except that I don't understand why you store the model.fit function to an object history.

Best Answer

Related Solutions

Solved – Struggling to train a MLP using Keras (Python)

Solved – How to set mini-batch size in SGD in keras

Related Question