Solved – Neural networks – Switching loss function during training for better gradients

deep learningloss-functionsneural networks

I'm training a neural network $D$ for binary classification using binary cross entropy loss (where $y_i$ is either 1 or 0, and $D(x)$ produces a value in $[0,1]$):

$$-y_i\log(D(x_i)) – (1 – y_i)\log(1 – D(x_i))$$

Early during training, when the classification performance is relative bad, this loss function gives good gradients. However, as performance increases, the gradients get progressively worse. For a discriminator D which performs well, the following loss function produces better gradients:

$$y_i\log(1-D(x_i)) + (1 – y_i)\log(D(x_i))$$

Note that both loss functions have their minimum at the same value of $D(x)$.

My question is:

  • Once $D$ reaches a certain performance, can I switch out the loss functions to get better gradients?
  • Has this been done before?
  • Would it be worth it?

Best Answer

  1. yes you can do it. It's your network, anything you can code, you can do to it. As was mentioned in the comments, it's just getting a starting point for different optimization problem. Your errors having the same minimum, doesn't matter that much, because the whole problem is nonconvex and you might get stuck in different local minimals/plateaus.

  2. did you hear about pre-training? Although I don't know about any published network that would use this method to get faster convergence, it's common to pre-train the network in some way before you actually train it with your final loss function. But usually it is done in unsupervised way.

  3. You just answered yourself. It's getting you a better gradients, so it's worth it with respect to gradients. The question is, if you can get better results by doing something else. E.g: using momentum, changing your alpha on the way or as was mentioned before, to use some pre-training methods