Solved – Gradient of Hinge loss – Math Solves Everything

I'm trying to implement basic gradient descent and I'm testing it with a hinge loss function i.e. $l_{\text{hinge}} = \max(0,1-y\ \boldsymbol{x}\cdot\boldsymbol{w})$. However, I'm confused about the gradient of the hinge loss. I'm under the impression that it is

$$
\frac{\partial }{\partial w}l_{\text{hinge}} =
\begin{cases}
-y\ \boldsymbol{x} &\text{if } y\ \boldsymbol{x}\cdot\boldsymbol{w} < 1 \\
0&\text{if } y\ \boldsymbol{x}\cdot\boldsymbol{w} \geq 1
\end{cases}
$$

But doesn't this return a matrix the same size as $\boldsymbol{x}$? I thought we were looking to return a vector of length $\boldsymbol{w}$? Clearly, I've got something confused somewhere. Can someone point in the right direction here?

I've included some basic code in case my description of the task was not clear

#Run standard gradient descent
gradient_descent<-function(fw, dfw, n, lr=0.01)
{
    #Date to be used
    x<-t(matrix(c(1,3,6,1,4,2,1,5,4,1,6,1), nrow=3))
    y<-c(1,1,-1,-1)
    w<-matrix(0, nrow=ncol(x))

    print(sprintf("loss: %f,x.w: %s",sum(fw(w,x,y)),paste(x%*%w, collapse=',')))
    #update the weights 'n' times
    for (i in 1:n)
    {
      w<-w-lr*dfw(w,x,y)
      print(sprintf("loss: %f,x.w: %s",sum(fw(w,x,y)),paste(x%*%w,collapse=',')))
    }
}
#Hinge loss
hinge<-function(w,x,y) max(1-y%*%x%*%w, 0)
d_hinge<-function(w,x,y){ dw<-t(-y%*%x); dw[y%*%x%*%w>=1]<-0; dw}
gradient_descent(hinge, d_hinge, 100, lr=0.01)

Update:
While the answer below helped my understanding of the problem, the output of this algorithm is still incorrect for the given data. The loss function reduces by 0.25 each time but converge too fast and the resulting weights do not result in a good classification. Currently the output looks like

#y=1,1,-1,-1
"loss: 1.000000, x.w: 0,0,0,0"
"loss: 0.750000, x.w: 0.06,-0.1,-0.08,-0.21"
"loss: 0.500000, x.w: 0.12,-0.2,-0.16,-0.42"
"loss: 0.250000, x.w: 0.18,-0.3,-0.24,-0.63"
"loss: 0.000000, x.w: 0.24,-0.4,-0.32,-0.84"
"loss: 0.000000, x.w: 0.24,-0.4,-0.32,-0.84"
"loss: 0.000000, x.w: 0.24,-0.4,-0.32,-0.84"
...

Best Answer

To get the gradient we differentiate the loss with respect to $i$th component of $w$.

Rewrite hinge loss in terms of $w$ as $f(g(w))$ where $f(z)=\max(0,1-y\ z)$ and $g(w)=\mathbf{x}\cdot \mathbf{w}$

Using chain rule we get

$$\frac{\partial}{\partial w_i} f(g(w))=\frac{\partial f}{\partial z} \frac{\partial g}{\partial w_i} $$

First derivative term is evaluated at $g(w)=x\cdot w$ becoming $-y$ when $\mathbf{x}\cdot w<1$, and 0 when $\mathbf{x}\cdot w>1$. Second derivative term becomes $x_i$. So in the end you get $$ \frac{\partial f(g(w))}{\partial w_i} = \begin{cases} -y\ x_i &\text{if } y\ \mathbf{x}\cdot \mathbf{w} < 1 \\ 0&\text{if } y\ \mathbf{x}\cdot \mathbf{w} > 1 \end{cases} $$

Since $i$ ranges over the components of $x$, you can view the above as a vector quantity, and write $\frac{\partial}{\partial w}$ as shorthand for $(\frac{\partial}{\partial w_1},\frac{\partial}{\partial w_2},\ldots)$

Best Answer

Related Solutions

Solved – Hinge loss with one-vs-all classifier

Solved – Gradient of loss function for (non)-linear prediction functions

Related Question