I train a network on a problem where the magnitudes of my loss function are quite small. I observed that the network didn't really train well until I started scaling up the loss function, for example multiplying with 1e5. My loss function is the quadratic cost.
My intuition is that this helped as bigger error magnitudes are propagated back through the network and it basically fights vanishing gradient in the earlier layers of the network. Removing the scaling and raising the learning rate did not help, it made the network diverge.
Any ideas why this helped? Or does this behavior indicate mistakes in my implementation?
Best Answer
Short answer:
It depends on the optimizer and the regularization term:
Without regularization, using SGD optimizer: scaling loss by $\alpha$ is equivalent to scaling SGD's learning rate by $\alpha$.
Without regularization, using Nadam: scaling loss by $\alpha$ has no effect.
With regularization, using either SGD or Nadam optimizer: changing the scale of prediction loss will affect the trade-off between prediction loss and regularization.
Full answer: