In the following, I present an example showing that the loss
of a ReLU network is in general not a convex function of the weights of the network.
This is the practically relevant statement, since what you are doing in practice
is to choose the weights of the network in such a way that the (empirical) loss
is minimized.
I will consider the setting of shallow ReLU networks (with one hidden layer)
with input dimension $d=1$ and output dimension $k=1$ and with two neurons in the hidden layer,
and using the identity function as the activation function in the last layer.
I will denote the ReLU by $\rho$, i.e., $\rho(x) = \max \{0,x\}$.
The weights $\theta$ of such a network are arranged as
$\theta = \big( (A_1, b_1), (A_2, b_2) \big)$, where $A_1 \in \mathbb{R}^{2 \times 1}$, $b_1 \in \mathbb{R}^2$,
$A_2 \in \mathbb{R}^{1 \times 2}$, and $b_2 \in \mathbb{R}^1$; the output of the network is then given by
$$
N_\theta (x)
= b_2
+ (A_2)_{1,1} \cdot \rho\big( (A_1)_{1,1} x + (b_1)_1 \big)
+ (A_2)_{1,2} \cdot \rho\big( (A_1)_{2,1} x + (b_1)_2 \big)
.
$$
I will consider the square loss (any other reasonable loss would also work) with target labels
$(x_1, y_1) = (1, 1)$ and $(x_2, y_2) = (-1, -1)$.
Thus, the function we want to minimize is
$$
\mathcal{L}: \quad
\mathbb{R}^{7}
\cong \mathbb{R}^{2 \times 1} \times \mathbb{R}^2 \times \mathbb{R}^{1 \times 2} \times \mathbb{R}^1
\to \mathbb{R},
\quad \theta \mapsto \mathcal{L}(\theta) = \frac{1}{2} \sum_{i=1}^2 \bigl(N_\theta (x_i) - y_i\bigr)^2
.
$$
Now, note that $x = \rho(x) - \rho(-x)$ for all $x \in \mathbb{R}$.
By a direct calculation, this shows that if we set
$$
\theta^{(j)} = \big( (A_1^{(j)}, b_1^{(j)}), (A_2^{(j)}, b_2^{(j)}) \big)
$$
for $j \in \{ 1,2 \}$, where
$$
A_1^{(1)} = (1, -1)^T, \quad
b_1^{(1)} = (0, 0)^T, \quad
A_2^{(1)} = (1, -1), \quad
b_2^{(1)} = (0)
$$
and
$$
A_1^{(2)} = (-1, 1)^T, \quad
b_1^{(2)} = (0, 0)^T, \quad
A_2^{(2)} = (-1, 1), \quad
b_2^{(2)} = (0)
,
$$
then $N_{\theta^{(1)}} (x) = x = N_{\theta^{(2)}}(x)$ for all $x \in \mathbb{R}$,
and this implies $\mathcal{L}(\theta^{(1)}) = 0 = \mathcal{L}(\theta^{(2)})$,
meaning that both $\theta^{(1)}$ and $\theta^{(2)}$ are global minimizers of $\mathcal{L}$.
If $\mathcal{L}$ was convex, this would imply that also
$\theta^{\ast} := \frac{1}{2} \theta^{(1)} + \frac{1}{2} \theta^{(2)}$
would be a global minimizer of $\mathcal{L}$.
But it is easy to see that $\theta^{\ast} = \big( (A_1,b_1), (A_2,b_2) \big)$ with
$$
A_1 = (0,0)^T, \quad
b_1 = (0,0)^T, \quad
A_2 = (0,0), \quad
b_2 = (0)
,
$$
from which it follows that $N_{\theta^\ast}(x) = 0$ for all $x \in \mathbb{R}$,
and this easily implies $\mathcal{L}(\theta^\ast) > 0$.
Best Answer
Suppose $f,g$ are twice differentiable, convex, both not increasing or not decreasing and positive. Than $fg$ is convex
Indeed by the generalised Leibnitz rule we have $$ \frac{d^2}{dx^2}(fg)= f''g+ f g'' + 2f'g' \ge 0 $$ As $f'',g'' \ge 0$ by convexity, $f'g' \ge 0$ because the two functions have the same monotonicity and $f,g \ge0$ by hypothesis. (at least in one dimension the condition of being twice differentiable is not necessary. See exercise 3.32 (a) of convex optimization, S. Boyd, L. Vandenberghe )
Another sufficient (but not necessary) condition is that the two function are log convex as in that case you can write the product as $$ fg(x)=e^{\log(f(x))+\log(g(x))} $$ As the sum of two convex function is convex the product is log-convex, so it is convex (see e.g. this)
I don't think one can do much better than this.