Solved – What’s the effect of scaling a loss function in deep learning

deep learningloss-functionsoptimization

I train a network on a problem where the magnitudes of my loss function are quite small. I observed that the network didn't really train well until I started scaling up the loss function, for example multiplying with 1e5. My loss function is the quadratic cost.

My intuition is that this helped as bigger error magnitudes are propagated back through the network and it basically fights vanishing gradient in the earlier layers of the network. Removing the scaling and raising the learning rate did not help, it made the network diverge.

Any ideas why this helped? Or does this behavior indicate mistakes in my implementation?

Best Answer

Short answer:

It depends on the optimizer and the regularization term:

  • Without regularization, using SGD optimizer: scaling loss by $\alpha$ is equivalent to scaling SGD's learning rate by $\alpha$.

  • Without regularization, using Nadam: scaling loss by $\alpha$ has no effect.

  • With regularization, using either SGD or Nadam optimizer: changing the scale of prediction loss will affect the trade-off between prediction loss and regularization.


Full answer:

  • No regularization + SGD: Assuming your total loss consists of a prediction loss (e.g. mean-squared error) and no regularization loss (such as L2 weight decay), then scaling the output value of the loss function by $\alpha$ would be equivalent to scaling the learning rate ($\eta$) by $\alpha$ when using SGD: $$L_{\text{new}} = \alpha L_{\text{old}}\\ \Rightarrow \nabla_{W_{t}}L_{\text{new}} = \alpha\nabla_{W_{t}}L_{\text{old}}\\ W_{t+1} = W_{t} - \eta \nabla_{W_{t}}L_{\text{new}}\\ = W_{t} - \eta \alpha\nabla_{W_{t}}L_{\text{old}}\\ = W_{t} - \eta'\nabla_{W_{t}}L_{\text{old}} \quad \text{where } \eta'=\eta\alpha$$
  • No regularization + Nadam: Using Nadam, assuming you do not have a regularization term, I do not believe there is going to be a difference in the training procedure if you just multiply the loss output by $\alpha$ all the way through training. In this case, in the update rule, $\hat{m_t}$ will be multiplied by $\alpha$ and $\hat{n_t}$ will be multiplied by $\alpha^2$, causing the scales to cancel each other out: $$ W_{t+1} = W_{t} - \frac{\eta} {\sqrt{\hat{n_t}+\epsilon}}\hat{m_t} $$
  • With regularization: This is most likely the case you are facing. If you have an explicit regularization term such as L2 weight decay in your loss, then scaling the output of your prediction loss changes the trade-off between your prediction loss and the regularization loss: $$L_\text{old} = \text{MSE} + \lambda*\text{weight_decay}\\ L_\text{new} = \alpha\text{MSE} + \lambda*\text{weight_decay}\\ = \alpha\left(\text{MSE} + \frac{\lambda}{\alpha}*\text{weight_decay}\right)$$
  • Using SGD, this would be equivalent to using $\alpha\eta$ as your new learning rate instead of $\eta$, and using $\frac{\lambda}{\alpha}$ as your new L2 regularization scale.
  • Using Nadam, the $\alpha$ scale in the back would have no effect, and in the end you just end up with $\frac{\lambda}{\alpha}$ as your new L2 regularization scale.