[Math] Study the convexity of Mean Squared Error with regularization

I want to study the convexity of the Mean Squared Error with regularization loss function. I am using an artificial neural network to compute the output.

$$E(w) = MSE(w) = \frac{1}{\mid D \mid}\sum_{d \in D} E_d(w)$$

$E_d$ is the error for a pattern $d$, defined as

$$E_{d}(w)=\frac{1}{2}\sum_{j \in outputLayer}^{} (t_{j}-o_{j}(w))^2 + \frac{1}{2}\lambda||w||^2$$

where $t_j$ and $o_j$ are respectively the target and the output of the $j^{th}$ output neuron of the network. $w$ are the parameters. $\lambda$
is the regularization term.

convexity study

Intuitively I can say that the function is neither convex nor concave, since there are several local minima. In detail…

$E_d(w)$ is a nonnegative weighted sum of two terms. This operation preserves convexity if the two terms are convex.
The second term, $\frac{1}{2}\lambda||w||^2$, is convex since the squared euclidean norm is convex and $\lambda$ is positive.
The first term, $\frac{1}{2}\sum_{j \in outputLayer}^{} (t_{j}-o_{j}(w))^2$, is neither convex nor concave because of the neural network…

for example, at the first layer of the neural network it is computed

$$g(x) = f(W \cdot x + b)$$

where $W \in R^{m \times n}$ is the first weight matrix, $x \in R^n$ is the network input, $b \in R^m$ is the bias and $f: R^m \rightarrow R^m$ is the element-wise activation function.

Here is when I break my analysis saying that the whole function is neither convex nor concave, depending on the activation function:

if $f := identity$ then $g$ is convex on $R$
if $f := ReLU$, then $g$ is convex on $R$
if $f := Sigmoid$, the $g$ is convex in $R_-$ but concave in $R_+$
if $f := tanh$, the $g$ is convex in $R_-$ but concave in $R_+$

I don't believe this analysis is sufficient, how can I improve it?

Best Answer

Here's at least a partial answer to what you're asking, I think.

You want to know whether or not the function $E$ is convex in $w$ if we take the activation function $f$ to be $\operatorname{ReLU}$.

The answer is no: consider a very simple network with two inputs (and a bias $b=1$) and one output. Suppose we have just one input pattern (i.e. $|D|=1$), which is the vector $x=(1,\ 1)$. Suppose the desired (target) output is $t=10$. The network is parameterized by the weight vector $w=(w_1,\ w_2)\in\mathbb{R}^2$. If we take $\lambda=1$, the error function is given by

\begin{align*} E(w)&=\frac{1}{2}\big(t-\max\{w^\text{T}x+b,0\}\big)^2+\frac{1}{2}\lambda\|w\|^2\\ &=\frac{1}{2}\big(10-\max\{w_1+w_2+1,0\}\big)^2+\frac{1}{2}(w_1^2+w_2^2). \end{align*}

However, we can plainly see that this function is not convex:

Best Answer

Related Solutions

[Math] derivative of cost function for Neural Network classifier

[Math] Problem on finding the set of biases and weights in a specific neural network

Related Question