Machine Learning – How Loss Functions of Neural Networks are Non-Convex

convex-analysisfunctionsmachine learningneural networks

I have heard the following argument being made regarding Neural Networks:

A Neural Network is a composition of several Activation Functions
Sigmoid Activation Functions are Non-Convex Functions
The composition of Non-Convex Functions can produce a Non-Convex Function
Thus, Loss Functions for Neural Networks that contain several Sigmoid Activation Functions can be Non-Convex

Using the R programming language, I plotted the second derivative of the Sigmoid Function and we can see that it fails the Convexity Test (i.e. the second derivative can take both positive and negative values):

e = 2.718

eq = function(x){ (-e^-x)* (1+e^-x)^-2  + (e^-x)*(-2*(1+e^-x)^-3 *(-e^-x))}

plot(eq(-100:100), type='l', main = "Plot of Second Derivative of the Sigmoid Function")

My Question: (If the above argument is in fact true) Can the same argument be extended to lack of Convexity of Loss Functions of Neural Networks containing several "RELU Activation Functions" ?

On it's own, the ReLU function is said to be Convex.
Mathematically, we can show that compositions of Convex Functions can only produce a Convex Function (The composition of two convex functions is convex). I understand that the composition of two Convex functions can produce a Concave Function – but still, the composition of two Convex Functions can not produce a "classic type of Non-Convex Function" (e.g. a function with several local minima and saddle points).

However, Neural Networks that contain compositions of (only) ReLU Activation functions make it unclear to me how a Loss Functions that contains (only) "RELU Activation Functions" would a Non-Convex. (The only thing I can think of is that the Loss Function in Neural Networks is made of linear combinations of function compositions – and even though compositions of convex functions are always convex, perhaps the linear combination of compositions for convex functions might not necessarily be convex … but I am not sure about this)

Can someone please comment on this? If compositions of Convex Functions can only produce Convex Functions – does this mean that the Loss Function of a Neural Network containing only containing ReLU Activation Functions can never be Non-Convex?

Thanks!

References:

https://ml-cheatsheet.readthedocs.io/en/latest/activation_functions.html

Note: Using some informal logic, I do not think that the Loss Functions of Neural Networks containing RELU Activation Functions are generally Convex. This is because RELU (style) Activation Functions are generally some of the most common types of activation functions being used – yet the same difficulties concerning mon-convex optimization still remain. Thus, I would like to think that Neural Networks with RELU Activation Functions are still generally non-convex.

For example, below we can see the Loss Surfaces of (Modern) Neural Networks that clearly look Non-Convex:

(https://www.cs.umd.edu/~tomg/projects/landscapes/)

Best Answer

In the following, I present an example showing that the loss of a ReLU network is in general not a convex function of the weights of the network. This is the practically relevant statement, since what you are doing in practice is to choose the weights of the network in such a way that the (empirical) loss is minimized.

I will consider the setting of shallow ReLU networks (with one hidden layer) with input dimension $d=1$ and output dimension $k=1$ and with two neurons in the hidden layer, and using the identity function as the activation function in the last layer. I will denote the ReLU by $\rho$, i.e., $\rho(x) = \max \{0,x\}$.

The weights $\theta$ of such a network are arranged as $\theta = \big( (A_1, b_1), (A_2, b_2) \big)$, where $A_1 \in \mathbb{R}^{2 \times 1}$, $b_1 \in \mathbb{R}^2$, $A_2 \in \mathbb{R}^{1 \times 2}$, and $b_2 \in \mathbb{R}^1$; the output of the network is then given by $$ N_\theta (x) = b_2 + (A_2)_{1,1} \cdot \rho\big( (A_1)_{1,1} x + (b_1)_1 \big) + (A_2)_{1,2} \cdot \rho\big( (A_1)_{2,1} x + (b_1)_2 \big) . $$

I will consider the square loss (any other reasonable loss would also work) with target labels $(x_1, y_1) = (1, 1)$ and $(x_2, y_2) = (-1, -1)$. Thus, the function we want to minimize is $$ \mathcal{L}: \quad \mathbb{R}^{7} \cong \mathbb{R}^{2 \times 1} \times \mathbb{R}^2 \times \mathbb{R}^{1 \times 2} \times \mathbb{R}^1 \to \mathbb{R}, \quad \theta \mapsto \mathcal{L}(\theta) = \frac{1}{2} \sum_{i=1}^2 \bigl(N_\theta (x_i) - y_i\bigr)^2 . $$

Now, note that $x = \rho(x) - \rho(-x)$ for all $x \in \mathbb{R}$. By a direct calculation, this shows that if we set $$ \theta^{(j)} = \big( (A_1^{(j)}, b_1^{(j)}), (A_2^{(j)}, b_2^{(j)}) \big) $$ for $j \in \{ 1,2 \}$, where $$ A_1^{(1)} = (1, -1)^T, \quad b_1^{(1)} = (0, 0)^T, \quad A_2^{(1)} = (1, -1), \quad b_2^{(1)} = (0) $$ and $$ A_1^{(2)} = (-1, 1)^T, \quad b_1^{(2)} = (0, 0)^T, \quad A_2^{(2)} = (-1, 1), \quad b_2^{(2)} = (0) , $$ then $N_{\theta^{(1)}} (x) = x = N_{\theta^{(2)}}(x)$ for all $x \in \mathbb{R}$, and this implies $\mathcal{L}(\theta^{(1)}) = 0 = \mathcal{L}(\theta^{(2)})$, meaning that both $\theta^{(1)}$ and $\theta^{(2)}$ are global minimizers of $\mathcal{L}$.

If $\mathcal{L}$ was convex, this would imply that also $\theta^{\ast} := \frac{1}{2} \theta^{(1)} + \frac{1}{2} \theta^{(2)}$ would be a global minimizer of $\mathcal{L}$. But it is easy to see that $\theta^{\ast} = \big( (A_1,b_1), (A_2,b_2) \big)$ with $$ A_1 = (0,0)^T, \quad b_1 = (0,0)^T, \quad A_2 = (0,0), \quad b_2 = (0) , $$ from which it follows that $N_{\theta^\ast}(x) = 0$ for all $x \in \mathbb{R}$, and this easily implies $\mathcal{L}(\theta^\ast) > 0$.

Best Answer

Related Solutions

Neural Networks and the Chain Rule

[Math] Neural Networks – Are these functions Lipschitz continuous

Related Question