Solved – Gradient for hinge loss multiclass

gradient descentloss-functions

I am a little confused when trying to find the gradient for the multiclass hinge loss:

$$l(y) = \max( 0, 1 + \underset{r \neq y_i}{ \max } W_r \cdot x_i – W_{y_i} \cdot x_i)$$

Where $W^{k \times n}$ is the matrix holding in each row the corresponding classifier of each class.

Unless my math is wrong, the gradient of the function is:
\begin{equation}
\frac{\partial l}{\partial w} = \begin{cases}
0, & W_{y_i}\cdot x_i > 1 + W_r \cdot x_i \\
0 + 0 – x_i, & \text{otherwise}
\end{cases}
\end{equation}

Is this ok?

So if I would like to find the $W$ which minimizes the function using the stochastic gradient descend I would need to do:
\begin{equation}
W_y^{(t+1)} = W_y^{(t)} – \eta x_i
\end{equation}

with $\eta$ the learning rate.

Is this a valid procedure to optimize the $l(y)$ function?

Best Answer

Let's use the example of the SVM loss function for a single datapoint:

$L_i = \sum_{j\neq y_i} \left[ \max(0, w_j^Tx_i - w_{y_i}^Tx_i + \Delta) \right]$

Where $\Delta$ is the desired margin.

We can differentiate the function with respect to the weights. For example, taking the gradient with respect to $w_{yi}$ we obtain:

$\nabla_{w_{y_i}} L_i = - \left( \sum_{j\neq y_i} \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) \right) x_i$

Where 1 is the indicator function that is one if the condition inside is true or zero otherwise. While the expression may look scary when it is written out, when you're implementing this in code you'd simply count the number of classes that didn't meet the desired margin (and hence contributed to the loss function) and then the data vector $x_i$ scaled by this number is the gradient. Notice that this is the gradient only with respect to the row of $W$ that corresponds to the correct class. For the other rows where $j≠{{y}_{i}}$ the gradient is:

$\nabla_{w_j} L_i = \mathbb{1}(w_j^Tx_i - w_{y_i}^Tx_i + \Delta > 0) x_i$

Once you derive the expression for the gradient it is straight-forward to implement the expressions and use them to perform the gradient update.

Taken from Stanford CS231N optimization notes posted on github.