Neural Network Optimization – Cost Function of Neural Network is Non-Convex

loss-functionsmachine learningneural networks

The cost function of neural network is $J(W,b)$, and it is claimed to be non-convex. I don't quite understand why it's that way, since as I see that it's quite similar to the cost function of logistic regression, right?

If it is non-convex, so the 2nd order derivative $\frac{\partial J}{\partial W} < 0$, right?

UPDATE

Thanks to the answers below as well as @gung's comment, I got your point, if there's no hidden layers at all, it's convex, just like logistic regression. But if there's hidden layers, by permuting the nodes in the hidden layers as well as the weights in subsequent connections, we could have multiple solutions of the weights resulting to the same loss.

Now more questions,

1) There're multiple local minima, and some of them should be of the same value, since they're corresponding to some nodes and weights permutations, right?

2) If the nodes and weights won't be permuted at all, then it's convex, right? And the minima will be the global minima. If so, the answer to 1) is, all those local minima will be of the same value, correct?

Best Answer

The cost function of a neural network is in general neither convex nor concave. This means that the matrix of all second partial derivatives (the Hessian) is neither positive semidefinite, nor negative semidefinite. Since the second derivative is a matrix, it's possible that it's neither one or the other.

To make this analogous to one-variable functions, one could say that the cost function is neither shaped like the graph of $x^2$ nor like the graph of $-x^2$. Another example of a non-convex, non-concave function is $\sin(x)$ on $\mathbb{R}$. One of the most striking differences is that $\pm x^2$ has only one extremum, whereas $\sin$ has infinitely many maxima and minima.

How does this relate to our neural network? A cost function $J(W,b)$ has also a number of local maxima and minima, as you can see in this picture, for example.

The fact that $J$ has multiple minima can also be interpreted in a nice way. In each layer, you use multiple nodes which are assigned different parameters to make the cost function small. Except for the values of the parameters, these nodes are the same. So you could exchange the parameters of the first node in one layer with those of the second node in the same layer, and accounting for this change in the subsequent layers. You'd end up with a different set of parameters, but the value of the cost function can't be distinguished by (basically you just moved a node, to another place, but kept all the inputs/outputs the same).

Related Question