In the following, I present an example showing that the loss
of a ReLU network is in general not a convex function of the weights of the network.
This is the practically relevant statement, since what you are doing in practice
is to choose the weights of the network in such a way that the (empirical) loss
is minimized.
I will consider the setting of shallow ReLU networks (with one hidden layer)
with input dimension $d=1$ and output dimension $k=1$ and with two neurons in the hidden layer,
and using the identity function as the activation function in the last layer.
I will denote the ReLU by $\rho$, i.e., $\rho(x) = \max \{0,x\}$.
The weights $\theta$ of such a network are arranged as
$\theta = \big( (A_1, b_1), (A_2, b_2) \big)$, where $A_1 \in \mathbb{R}^{2 \times 1}$, $b_1 \in \mathbb{R}^2$,
$A_2 \in \mathbb{R}^{1 \times 2}$, and $b_2 \in \mathbb{R}^1$; the output of the network is then given by
$$
N_\theta (x)
= b_2
+ (A_2)_{1,1} \cdot \rho\big( (A_1)_{1,1} x + (b_1)_1 \big)
+ (A_2)_{1,2} \cdot \rho\big( (A_1)_{2,1} x + (b_1)_2 \big)
.
$$
I will consider the square loss (any other reasonable loss would also work) with target labels
$(x_1, y_1) = (1, 1)$ and $(x_2, y_2) = (-1, -1)$.
Thus, the function we want to minimize is
$$
\mathcal{L}: \quad
\mathbb{R}^{7}
\cong \mathbb{R}^{2 \times 1} \times \mathbb{R}^2 \times \mathbb{R}^{1 \times 2} \times \mathbb{R}^1
\to \mathbb{R},
\quad \theta \mapsto \mathcal{L}(\theta) = \frac{1}{2} \sum_{i=1}^2 \bigl(N_\theta (x_i) - y_i\bigr)^2
.
$$
Now, note that $x = \rho(x) - \rho(-x)$ for all $x \in \mathbb{R}$.
By a direct calculation, this shows that if we set
$$
\theta^{(j)} = \big( (A_1^{(j)}, b_1^{(j)}), (A_2^{(j)}, b_2^{(j)}) \big)
$$
for $j \in \{ 1,2 \}$, where
$$
A_1^{(1)} = (1, -1)^T, \quad
b_1^{(1)} = (0, 0)^T, \quad
A_2^{(1)} = (1, -1), \quad
b_2^{(1)} = (0)
$$
and
$$
A_1^{(2)} = (-1, 1)^T, \quad
b_1^{(2)} = (0, 0)^T, \quad
A_2^{(2)} = (-1, 1), \quad
b_2^{(2)} = (0)
,
$$
then $N_{\theta^{(1)}} (x) = x = N_{\theta^{(2)}}(x)$ for all $x \in \mathbb{R}$,
and this implies $\mathcal{L}(\theta^{(1)}) = 0 = \mathcal{L}(\theta^{(2)})$,
meaning that both $\theta^{(1)}$ and $\theta^{(2)}$ are global minimizers of $\mathcal{L}$.
If $\mathcal{L}$ was convex, this would imply that also
$\theta^{\ast} := \frac{1}{2} \theta^{(1)} + \frac{1}{2} \theta^{(2)}$
would be a global minimizer of $\mathcal{L}$.
But it is easy to see that $\theta^{\ast} = \big( (A_1,b_1), (A_2,b_2) \big)$ with
$$
A_1 = (0,0)^T, \quad
b_1 = (0,0)^T, \quad
A_2 = (0,0), \quad
b_2 = (0)
,
$$
from which it follows that $N_{\theta^\ast}(x) = 0$ for all $x \in \mathbb{R}$,
and this easily implies $\mathcal{L}(\theta^\ast) > 0$.
Best Answer
I think you might be mixing some things here. In general, gradient descent is probably the simplest optimization algorithm there is, and it requires very few assumptions. You do have to assume that the function is continuously differentiable. However, if you seek to ensure convergence to a global minimum or you wish to derive the complexity of the algorithm, things get a bit more complicated.
Given that you make no assumptions on the function, then gradient descent can only be guaranteed to advance towards a stationary point (local min/max or saddle point), if such a point exists. The other assumptions will give you additional guarantees:
Note that neural networks optimize cost functions that do not meet any of these assumptions. Moreover, they may not even be continuously differentiable. Therefore neural networks have no theoretical guarantees, but in practice they will converge to a local minimum or saddle point. The rate of convergence will be unknown, and in fact they will use a variant of gradient descent called stochastic gradient descent (SGD). SGD processes the data in batches at every iteration, since calculating and storing the gradients might be too expensive from the perspective of memory and computation resources required.