Solved – Can small SGD batch size lead to faster overfitting

deep learningkerasneural networksoverfittingstochastic gradient descent

I have feedforward neural net, trained on cca 34k samples and tested on 8k samples. There is 139 features in dataset. The ANN does classification between two labels, 0 and 1, so I am using sigmoid function in last layer and two hidden layers, both with 400 units. NN is created using following Keras code:

model = Sequential()
model.add(Dense(units=139, input_dim=len(list(X))))

model.add(Dense(units=400))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(units=400))
model.add(Activation('relu'))
model.add(Dropout(0.2))

model.add(Dense(units=1))
model.add(Activation('sigmoid'))

model.compile(optimizer='sgd',
          loss='binary_crossentropy',
          metrics=['accuracy'])

checkpoint = ModelCheckpoint('tempmodelKeras.h5',period=1)
custom = LambdaCallback(on_epoch_end=lambda epoch,logs: test_callback_wrapper())
model.fit(X, y, epochs=500, batch_size=128, callbacks=[checkpoint,custom])

test_callback_wrapper() is just for testing model after each epoch on test dataset and then calculating average precision score for different thresholds.

Now the part I need to help with:
Following images shows epoch number on X axis, and average precision on Y axis, for test set.

Average precision for different batch sizes during epochs

I tried to three different batch sizes (32, 128, 256), but if you look at the plot, it looks that smaller batch sizes are 'faster' in terms of number of epoch needed to reach maximum average precision, but are more prone to overfitting. However, I read several articles, where was written that larger batches usually leads to overfitting and smaller batches are better. How is it possible that in my image, it looks vice versa, that larger batches perform better?

Best Answer

Using "epoch" on the x-axis to compare "speed" of convergence while comparing different batch sizes makes no sense since the number of weight updates per epoch depends on the batch size. When using smaller batches, the learning algorithm performs more weight updates per epoch and it naturally it seems to converge faster.

In your image, the gray curve (batch size = 32) starts decreasing (overfitting) around epoch 40. The orange curve (batch size = 128, four times more) seems to peak around epoch 160—four times later, which is exactly after the same number of weight updates. Possibly you might see similar peak on the blue curve around epoch 320.

Generally, smaller batches lead to noisier gradient estimates and are better capable to escape poor local minima and prevent overfitting. On the other hand, tiny batches may be too noisy for good learning. In the end, it is just another hyperparameter that one needs to tune on the specific dataset. In your case, it does not seem to play a big role, since all curves peak around the same accuracy, and early stopping should lead to the same performance regardless of the batch size.