Solved – I cannot differentiate the loss function, what is the best method for optimizing the weights in the neural network

backpropagationgradient descentmachine learningneural networks

Suppose the output of my network is $y \in [0, 1]$ for a given input x.

The loss function to be minimized is $f(x) = -\Sigma (w_i y_i + | y_i – y_{i-1}|)$, with real weights $w_i$.

The dependency of one output to the other means I can't calculate derivatives and plug this into a backpropagation algorithm. (I've tried and failed.)

I see two questions:

  • Have I given up on calculus too early? Is it possible to differentiate $f(x)$ for use in backpropagation?
  • If not, what efficient methods can I use to find an optimal solution? Right now I randomly tinker with weights and biases, and accept them if I find a new minimum.

Best Answer

Let $\theta$ denote a vector of network parameters to learn (weights/biases) and $g$ denote the function the network implements, so $y_i = g(x_i)$ (implicitly depends on $\theta$). It looks like the gradient of your loss function w.r.t. $\theta$ is:

$$-\sum_{i=1}^n [ w_i \nabla_\theta g(x_i) + \text{sgn}(g(x_i) - g(x_{i-1})) [\nabla_\theta g(x_i) - \nabla_\theta g(x_{i-1})] ]$$

On one hand, this seems pretty clean, because you can easily find the gradient of $g$, evaluate it for the different inputs, and plug in those values. There are 2 tricky parts:

Because of the absolute value in the loss function, its gradient is undefined when two subsequent outputs are identical. Depending on your data, that might be a very unlikely circumstance. Otherwise, I suppose you could call the expression above a subgradient and use it to do subgradient descent, which looks pretty much like regular gradient descent. The sgn function will return 0 where subsequent outputs are identical. Consider that something similar occurs with ReLUs; the activation function is non-differentiable at 0, but the subgradient is typically declared to be 0, 0.5, or 1 at this point, and standard gradient-based learning methods work fine (e.g. see here).

As you mention, standard online training methods (e.g. stochastic gradient descent) might not work so well because of the dependence on sequential outputs. You could try doing batch training, where you sweep through all data points to compute the gradient and update parameters once per sweep.

Related Question