Neural Network Optimization – Why Can’t a Penalty Make the Neural Network Objective Convex?

machine learningneural networksoptimizationreferences

When we use Neural Networks to solve various tasks, we define an objective that the Network parameters $\theta=(\theta_i)_{1\le i\le N}\in\Theta^N$ have to minimize. So, for neural networks $f(\cdot|\theta) $ parameterized by an array $(\theta_i)_i $, the objective can be written as follows
$$(\hat \theta):=\arg\min_{\theta\in\Theta^N} \mathcal L[f(\cdot|\theta )]\tag1$$
Where $\mathcal L$ is a loss function that we would like as small as possible.

A well known fact about neural networks is that, in essentially all non-trivial cases, the objective $(1)$ is highly non-convex and thus hard to optimize. It is true that SGD and its variants do a very good job, but we basically have no guarantees (as far as I know) not to get trapped in a bad local minima, or to get in a saddle point region which will make the training very slow.

A (probably naive) "solution" to these issues that occured to me is to "convexify" the objective, by adding a large enough strictly convex regularization term. In other words, solving the following objective :
$$\hat \theta:=\arg\min_{\theta\in\Theta^N} \mathcal L[f(\cdot|\theta) ] + \lambda\mathcal P[\theta ]\tag2 $$
Where $\lambda$ is a real hyperparameter and $\mathcal P$ is a strictly convex penalty (typically, $\mathcal P:\theta\mapsto\|\theta\|_2^2$).

Now, if the set $\Theta$ is compact (bounded, really) and $\mathcal L$ is twice-differentiable w.r.t. $\theta$, it is not hard to show that, for $\lambda$ large enough, the problem $(2)$ becomes strictly convex, which is appealing for a number of reasons :

  • Convex optimization is much easier than non-convex optimization, therefore the optimization algorithms in this case could achieve much better performance.
  • We are guaranteed that there exists a unique global minimizer of $(2)$ and to converge to it, so we are guaranteed to find the network with the "best possible performance".
  • Aside from convexity, the penalty term $\lambda\mathcal P$ acts as a regularizer : it penalizes too complex solutions and makes the network less likely to overfit, which is a very desirable property.

Due to these observations, I was expecting that method (which I would call strongly regularized networks) to have been well studied in the literature, but I surprisingly haven't been able to find anything so far. I am aware that this method is perhaps a bit naive, here are a some points against it I can think of :

  • Although $\Theta$ and thus the Hessian of $\mathcal L$ are bounded, in practice the minimum $\lambda$ which ensures strong convexity of $(2)$ would have to be way too large for the solutions to be exploitable. This makes sense although I still think that one could find practical applications in which $\lambda$ wouldn't have to be that large.
  • A strongly convex penalty such as $\theta \mapsto \|\theta\|_2^2$ is not that desirable because it does not encourage sparsity, which seems to be one of the main things people are looking for when dealing with deep nets.
  • This is, in some sense, already equivalent to existing and well-studied regularization methods (here are I am particularly thinking about the equivalence between Tikhonov and Ivanov regularization). If that is the case, I would like some references highlighting this, since I haven't been able to.

So, my question is : Has the setting $\mathbf{(2)} $ been well studied in the literature ? If not, why ? If so, can you please provide some references ?

Thank you.


Update : Following @Sycorax's great answer below, I still have some concerns :

  • I understand the non-identifiability argument given below, and it seemingly implies that if $\mathcal P$ is a square loss penalty, $(2)$ will still not be convex. I would however like to know what is wrong with the following argument :

Denote by $\nabla^2\mathcal L$ the Hessian matrix of $\mathcal L$. $\nabla^2\mathcal L$ is well defined and is bounded on $\Theta^N$ (again, not a big assumption, if you take common loss functions and activations it is known that the second derivative exists and is continuous). If $\mathcal P$ is the least square loss then the Hessian matrix of $\lambda \mathcal P$ is $2\lambda\mathbf I_N$. Therefore, the Hessian matrix of the objective $(2)$ is $\nabla^2\mathcal L + 2\lambda\mathbf I_N$. Now, the definition of convexity tells us that $(2)$ is a striclty convex problem if and only if the Hessian is positive definite in the domain, i.e. if
$$\theta^T(\nabla^2 \mathcal L+2\lambda\mathbf I_N)\theta >0,\quad\forall\theta\in\Theta^N-\{0\} $$
Now because the domain $\Theta^N$ is bounded, the eigenvalues of $\nabla^2 \mathcal L$ are bounded on the domain as well, so taking $\lambda$ such that $2\lambda + \psi^*>0$ where $\psi^*:=\inf_{\theta\in\Theta^N}\mathrm{eig}(\nabla^2 \mathcal L)$ guarantees that the above inequality is always satisfied, and thus strict convexity of the problem. However, this directly contradicts the non-identifiability argument given in the answer below. What is wrong with this argument ?

  • Even if the least square penalty doesn't solve the problem of convexity, my question still stands. Indeed, the non-identifiability issue could be easily fixed by letting $\epsilon = (\epsilon_i)_{1\le i\le N}$ be fixed (small) hyperparameters, and defining
    $$\mathcal P : \theta \mapsto \sum_{i=1}^N (\theta_i-\epsilon_i)^2$$
    This way, by having a big enough $\lambda$, the convexity would be guaranteed by a similar argument as above, and if the $\epsilon_i$ are distinct then the non-identifiability issue disappears as well (in a way, it's like slightly disturbing a matrix to ensure it is invertible). Has this been studied ? Is this a ridiculous idea ?

Best Answer

Your example of $\mathcal{P}=\| \theta \|_2^2$ is called L2 regularization (or "weight decay"), and has been very well-studied in the context of neural networks (for example, Goodfellow et al, Deep Learning (2016), or Hastie et al, The Elements of Statistical Learning (2009)). It can make problems that are not strictly convex into problems that are strictly convex. One example of this is a neural network with 0 hidden layers and 1 or more output neurons; this is typically called and any $\lambda >0$ has exactly one $\theta$ that minimizes the loss in $(2)$. Ridge regression is one method to estimate unique regression coefficients when the design matrix is rank-deficient.

But you probably have in mind the more usual neural network setting where the network is composed of several hidden layers with nonlinear, monotonic activation between them. In this setting, $\mathcal{P}$ doesn't fix non-convexity because the problem doesn't arise from rank deficiency. The convexity problem arises from non-identifiability. Consider the case of a neural network with 1 hidden layer and 2 neurons, and you have found a minimum $\theta^*$. You can permute neuron 1 and neuron 2 without changing the likelihood. You can also negate the weights in neuron 1 and the corresponding weights in either the previous or subsequent layers. If the loss is a minimum at $\theta_{(1)}^*$, then it must also be under a permutation or a reflection of the weights $\theta_{(2)}^*$ (in both cases, they compute the same function). Notably, in both cases, the value of $\| \theta_{(i)}^* \|_2^2$ is identical (change of sign or permutation of elements doesn't change the norm), so this choice of $\mathcal{P}$ does not change the non-convexity of $\mathcal L$.

In other words, making an unrestricted neural network into a convex problem implies discarding the concept of hidden layers and then the problem becomes convex. However, this would mean that we're losing all of power of compounding latent representations -- essentially, we're back to doing regression.

We can generalize this result by using the definition of convexity. What follows is a proof by contradiction. If we denote the minimization $(2)$ as $$ f(\theta) = \mathcal L(\theta; y, X) + \lambda \mathcal P(\theta) $$ then we see that $f$ has two parts, $\mathcal L(\theta)$ and $\mathcal P (\theta)$. Assuming for the moment that $f$ is convex, we know that for for $0 \le t \le 1$ and all $\theta$ in a convex subset, we can write

$$\begin{align} f (t\theta_1 + (1-t)\theta_2) &\le t f (\theta_1) + (1-t) f(\theta_2) \\ \mathcal L (t\theta_1 + (1-t)\theta_2) + \lambda \mathcal P (t\theta_1 + (1-t)\theta_2) &\le t \lambda \mathcal P (\theta_1) + (1-t) \lambda \mathcal P(\theta_2) + t \mathcal L (\theta_1) + (1-t) \mathcal L(\theta_2) \end{align}$$

From the statement of the problem, $\mathcal P$ is convex, which implies $ 0 \le t \lambda \mathcal P (\theta_1) + (1-t)\lambda \mathcal P(\theta_2) - \lambda \mathcal P (t\theta_1 + (1-t)\theta_2) = \lambda C$, so we can rewrite this as

$$\begin{align} \mathcal L (t\theta_1 + (1-t)\theta_2) &\le t \mathcal L (\theta_1) + (1-t) \mathcal L(\theta_2) + \lambda C \end{align}$$

so we have the definition of convexity, but we've added a non-negative constant to the right-hand side. However, we know from the statement of the problem that $\mathcal{L}$ is not convex, so we can't write down this inequality. We know there is some $\theta_1, \theta_2$ that violates the definition of convexity.

By the non-convexity of $\mathcal L$ we know that $$\begin{align} \mathcal L (t\theta_1 + (1-t)\theta_2) &\gt t \mathcal L (\theta_1) + (1-t) \mathcal L(\theta_2) + \lambda C \end{align},$$ for at least one pair $\theta_1, \theta_2$. At this level of generality, there's not necessarily anything that can be done. We can give a flavor of this by considering some simpler functions. If we have $\mathcal L(\theta) = - \theta^3$ and $\lambda \mathcal P =\lambda \theta^2$, then our $f(\theta) = -\theta^3 + \lambda \theta^2$, so for some fixed $\lambda$, there exist $\theta_1, \theta_2$ that violate the definition of convex. In other words, the second derivative changes with $\theta$, so you can't bound it.

But your question asks about some restricted set $\theta \in \Theta$. That's not the usual framing for a neural network optimization task; usually, this is an unconstrained optimization.

But let's see where a constrained optimization on some convex set $\Theta$ takes us. In this setting, we can bound $\mathcal L (t\theta_1 + (1-t)\theta_2)$, so we could just increase $\lambda$ until the inequality flips. That's all well and good from an optimization perspective. But it's possible, or perhaps even likely, that we this will result in under-fitting. Underfitting (and its cousin, ) are discussed all over the machine learning and neural networks literature, including in the Hastie et al text cited above. In the context of this problem, under-fitting means at least one of the following:

  • The chosen $\lambda$ is so large that the neural network does not fit the data well enough to solve any particular problem. As with a ridge regression, setting $\lambda$ too large will give a function that is too biased towards $\mathcal P$ to be helpful. (It may also be hard to know the smallest $\lambda$ such that the Hessian of $f$ is PD, depending on the network and the data.)
  • The set of $\Theta$ is so restricted that the neural network does not fit the data well enough to solve any particular problem. (It might also be challenging to know how to choose $\Theta$ in this way, depending on the form of the network.)

In the specific case of $\mathcal{P} = \| \theta \|_2^2$, the fact that you can reflect or permute the weights in a certain way without changing $f$ tells us that adjusting $\lambda$ by itself is not enough to create a convex $f$. If we want convex $f$, then we will need to choose a specific convex set $\Theta$ so that reflections and permutations of weights are excluded. In general, this seems challenging.

At the end of the day, people don't use neural networks for the joy of optimizing functions, people use them to solve some problem with data. If the restricted $\Theta$ or large $\lambda$ is not flexible enough to solve a real problem, we haven't really made much progress. Conversely, if some (possibly non-optimal) $\theta$ found using a non-convex neural network from $(1)$ solves your problem, you've done all the work that you need to do. Most research focuses on $(1)$ because these are the most flexible models and provide generic frameworks to train models.

Related Question