Solved – Why scale cost functions by 1/n in a neural network

gradient descentneural networks

In neural networks, I always see cost functions scaled by 1/n, where n = the total number of training examples. I also see it in things like adding a regularization term on to the cost. Why is this done?

For example, here is the cross entropy loss with a regularization term at the end
enter image description here

This scaling means that the gradients for updating the weights are also scaled, so increasing your training data slows down the learning speed. If you double your training data, you will be learning at half the speed. For 2000 examples (which is nothing!) I need an insane learning rate of 1.0 just to get anywhere, which means certainly I am misunderstanding something here.

When I take out this 1/n scaling, I have no problems at all, and I don't have to adjust my learning rate when I add more training data, which makes me even more sure that I am missing something here.

Another problem is that I'm using a type of very dynamic data augmentation, where every mini batch is randomly rotated, scaled, noise added, etc. so that the network never really sees the same input twice. How would you even use this 1/n scaling in this case?

Best Answer

In terms of mini-batch learning, $n$ should be the size of the batch instead of the total amount of training data (which in your case should be infinite).

Gradients are scaled by $1/n$ because we are taking the average of the batch, so the same learning rate can be used regardless of the size of the batch.

Edit
enter image description here
I found this later in that page, which shows how to set hyper-parameters in the context of mini-batch learning. So it might seem that the formula and the paragraph you refer to are talking about full batch learning, in which case $n$ should be the total number of training data.

Edit

So now the question would become "why you use the total training data size in regularization, instead of the batch size?"

I don't know whether there's a theoretical interpretation for this, for me it is just to make the hyper-parameters stay irrelevant to the batch size, since we are not summing the regularization term across the batch.