[Math] Smooth Hinge Loss Lipschitz Constant

Given the smooth hinge loss $L_\epsilon$ as follows

$L_\epsilon(y_i (w^T x_i + b)) = \begin{cases} 0 & y_i (w^T x_i + b) \\ \frac{(1-y_i (w^T x_i + b))^2}{2 \delta} & 1 – \delta < y_i (w^T x_i + b) \le 1 \\ 1 – y_i (w^T x_i + b) & y_i (w^T x_i + b) \le 1 – \delta \end{cases}$

and the objective function for a SVM, where I want to minimize with respect to $w, b$ and $x_i$ is the $i$th training example with $y_i \in {-1, 1}$, as

$f = \frac{C}{N} \sum_{i = 1}^N L_\epsilon(y_i (w^T x_i + b)) + \frac{1}{2} ||w||^2$

I want to compute the Lipschitz constant and the strongly convexity parameter of the above function so I can use the minimization algorithm by Nesterov to compute the optimal solution. These values are given by the largest and smallest eigenvalues of the Hessian, respectively.

I have computed the Hessian as follos

$\nabla^2 L_\epsilon = \begin{pmatrix} \frac{\partial^2L_\epsilon}{\partial w^2} & \frac{\partial^2L_\epsilon}{\partial w \partial b} \\ \frac{\partial^2L_\epsilon}{\partial w \partial b} & \frac{\partial^2L_\epsilon}{\partial b^2}\end{pmatrix}$

with

$\frac{\partial^2L_\epsilon}{\partial w^2} = \begin{cases} 0 & y_i (w^T x_i + b) \\ \frac{y_i^2 x_i x_i^T}{\delta} & 1 – \delta < y_i (w^T x_i + b) \le 1 \\ 0 & y_i (w^T x_i + b) \le 1 – \delta \end{cases}$

$\frac{\partial^2L_\epsilon}{\partial b^2} = \begin{cases} 0 & y_i (w^T x_i + b) \\ \frac{y_i^2}{\delta} & 1 – \delta < y_i (w^T x_i + b) \le 1 \\ 0 & y_i (w^T x_i + b) \le 1 – \delta \end{cases}$

$\frac{\partial^2L_\epsilon}{\partial w \partial b} = \begin{cases} 0 & y_i (w^T x_i + b) \\ \frac{y_i^2 x_i}{\delta} & 1 – \delta < y_i (w^T x_i + b) \le 1 \\ 0 & y_i (w^T x_i + b) \le 1 – \delta \end{cases}$

So the Hessian of $f$ becomes

$\nabla^2 f = \frac{C}{N} \nabla^2 L_\epsilon + 1$

But now I'm stuck. Can I, for given training examples $\{x_i\}_{i = 1}^N$ compute the Hessian taking the second case, or do I have something wrong in the Hessian?

Best Answer

The function $L_\epsilon$ is not strongly convex, because for $|b|$ or $\|w\|$ sufficiently large, it is locally linear, ie. $\nabla^2 L_\epsilon =0$. Therefore, the objective $f$ is not strongly convex with respect to $b$. It is strongly convex with respect to $w$ (with strong convexity constant 1), solely due to the $\frac{1}{2}\|w\|^2$ term.

Computing the Lipschitz constant exactly may be more difficult than the original problem. However, there are different ways to bound it. The Hessian of the objective can be written as: $$ \nabla^2 f = \begin{pmatrix} I & 0 \\ 0 &0 \end{pmatrix}+ \frac{C}{N \delta} \sum_{i} [1−\delta <y_i(w^T x_i+b)\le 1] \begin{pmatrix} x_i \\ 1 \end{pmatrix} \begin{pmatrix} x_i \\ 1 \end{pmatrix}^T $$ We can bound the spectral norm of the Hessian by noting that in the extreme case all the terms in the sum could be active: $$ \|\nabla^2 f\| \le 1+ \frac{C}{N \delta} \left \|\sum_{i} \begin{pmatrix} x_i \\ 1 \end{pmatrix} \begin{pmatrix} x_i \\ 1 \end{pmatrix}^T \right\| $$ This in turn can be bounded by quantities that are easier to compute. For example, applying the triangle inequality gives a crude bound: $$ \|\nabla^2 f\| \le 1+ \frac{C}{N \delta} \sum_{i=1}^N (\|x_i\|^2+1) $$ Or you can try bounding the spectral norm by the Frobenius norm: $$ \|\nabla^2 f\| \le 1+ \frac{C}{N \delta} \sqrt{\sum_{i=1}^N \sum_{j=1}^N (x_i^T x_j+1)^2} $$ However, I expect using these bounds in the basic Nesterov algorithm will result in overly conservative step sizes. The difficulty in (efficiently) estimating the Lipschitz constant is why it's useful to design methods which don't require knowing it up front.

Best Answer

Related Solutions

[Math] Gradient vector function using sum and scalar

[Math] Notation in the derivative of the hinge loss function

Related Question