How did the author take the gradient to get $\overline{W} \Leftarrow \overline{W} – \alpha \nabla_{W} L_i$

gradient descentmachine learningmultivariable-calculusneural networksvector analysis

I am currently studying the textbook Neural Networks and Deep Learning by Charu C. Aggarwal. Chapter 1.2.1.1 What Objective Function Is the Perceptron Optimizing? says the following:

Can we find a smooth loss function, whose gradient turns out to be the perceptron update? The number of classification errors in a binary classification problem can be written in the form of a $0/1$ loss function for training data point $(\overline{X_i}, y_i)$ as follows:

$$L_i^{(0/1)} = \dfrac{1}{2} (y_i – \text{sign}\{ \overline{W} \cdot \overline{X_i} \})^2 = 1 – y_i \cdot \text{sign} \{ \overline{W} \cdot \overline{X_i} \} \tag{1.7}$$

The simplification to the right-hand side of the above objective function is obtained by setting both $y_i^2$ and $\text{sign}\{ \overline{W} \cdot \overline{X_i} \}^2$ to $1$, since they are obtained by squaring a value drawn from $\{ -1, +1 \}$. However, this objective function is not differentiable, because it has a staircase-like shape, especially when it is added over multiple points. Note that the $0/1$ loss above is dominated by the term $- y_i \cdot \text{sign} \{ \overline{W} \cdot \overline{X_i} \}$, in which the sign function causes most of the problems associated with non-differentiability. Since neural networks are defined by gradient-based optimization, we need to define a smooth objective function that is responsible for the perceptron updates. It can be shown that the updates of the perceptron implicitly optimize the perceptron criterion. This objective function is defined by dropping the sign function in the above $0/1$ loss and setting negative values to $0$ in order to treat all correct predictions in a uniform and lossless way:
$$L_i = \max\{ – y_i ( \overline{W} \cdot \overline{X_i} ), 0 \} \tag{1.8}$$
The reader is encouraged to use calculus to verify that the gradient of this smoothed objective function leads to the perceptron update, and the update of the perceptron is essentially $\overline{W} \Leftarrow \overline{W} – \alpha \nabla_{W} L_i$.

How did the author take the gradient to get $\overline{W} \Leftarrow \overline{W} – \alpha \nabla_{W} L_i$?

Best Answer

Some strategies I have found useful when self-studying ML (before I found out about stackexchange), is that if you are struggling with a particular author's presentation, consult other reputable sources on the same topic, as another author may present it in a slightly different way that might be more amenable to how you are thinking about the problem. Furthermore, I've found that studying this way allows you to get a more well-rounded perspective, as different authors, depending on their field of specialisation, will often emphasise different aspects of the same topic.

I haven't seen this particular presentation before, but much of the part on how 0/1 loss is incompatible with gradient-based methods, due to discontinuity/non-differentiability coheres with what I know from the sources I used to learn this.

Loss function - intuition.

Where there is a minor difference, but one which has consequences for intuition of what is going on, is the specification of the perceptron update criterion in terms of the loss function per training point outlined here:

$$L_i = \max\{-y_i (\bar{W} \cdot \bar{X}_i), 0 \}$$

Crucially, the author hasn't been entirely explicit about emphasising the fact that contributions to gradient updates only occur for misclassified points (although that can be deduced from the above). As that might be unclear at this stage, here is how it works. Let's rewrite the loss per training point piece-wise:

$$L_i =\max\{-y_i (\bar{W} \cdot \bar{X}_i), 0 \} = \begin{cases} -y_i(\bar{W} \cdot \bar{X}_i) &\text{if} \space -y_i(\bar{W} \cdot \bar{X}_i) > 0 \\ 0 &\text{otherwise} \\ \end{cases}$$

Now for correctly classified points, $L_i = 0$. Why? Because your label $y_i \in \{-1, +1\}$ will have the same sign as your prediction $\bar{W} \cdot \bar{X}_i$, due to the classification being correct. This means that $y_i(\bar{W} \cdot \bar{X}_i) > 0$ and hence $-y_i(\bar{W} \cdot \bar{X}_i) < 0$, and $L_i = 0$.

Now for incorrectly classified points, $L_i = -y_i(\bar{W} \cdot \bar{X}_i)$. Why? The sign of the label $y_i$ will be different to your prediction $\bar{W} \cdot \bar{X}_i$, and so $y_i(\bar{W} \cdot \bar{X}_i) < 0$, meaning that $-y_i(\bar{W} \cdot \bar{X}_i) > 0$, and hence $L_i = -y_i(\bar{W} \cdot \bar{X}_i)$.

Hopefully, this clarifies the fact that the loss function $L_i$ is assigning a loss of $0$ to correctly classified points, and a loss of $-y_i(\bar{W} \cdot \bar{X}_i)$ to incorrectly classified points.

Stochastic gradient descent and the gradient of the loss function.

Recall we want to set $\bar{W}$ to minimise the sum of individual losses over the entire data set, that is we want to minimise $\sum^n_{i=1} L_i$. In the case of the perceptron, because we cannot just minimise $\sum^n_{i=1} L_i$ analytically by solving $\nabla_{W}\sum^n_{i=1} L_i = 0$, we use stochastic gradient descent.

So the expression $\bar{W} \leftarrow \bar{W} - \alpha \nabla_{W} L_i$ just refers to iteratively updating the parameter vector using stochastic gradient descent. It is stochastic because you are using only one training point $(\bar{X}_i, y_i)$ at a time to update the parameter vector $\bar{W}$. $\alpha$ is just a learning rate parameter.

Where gradients have been taken is in $\nabla_{W} L_i$, which is the gradient of the loss on an individual training point.

Now, to see that contributions to gradient updates of the parameter only occur for misclassified points, notice that when a point is correctly classified $L_i = 0 \implies \nabla_{W} L_i = 0$, and that when a point is incorrectly classified $L_i = -y_i (\bar{W} \cdot \bar{X}_i) \implies \nabla_{W} L_i = -y_i \bar{X}_i$.

Yielding, for misclassified points only:

$$\bar{W}^{(t+1)} = \bar{W}^{(t)} - \alpha \nabla_{W}L_i = \bar{W}^{(t)} + \alpha y_i \bar{X}_i$$

Where $t$ refers to iteration number. How does this yield the perceptron algorithm? Start with an initial guess $\bar{W}^{(0)}$. Out of all misclassified points, i.e. those points where the sign of the label $y_i$ is not the same as the sign of your prediction $\bar{W}^{(0)} \cdot \bar{X}_i$, randomly select one and update $\bar{W}$ using the above formula to get $\bar{W}^{(1)}$, and repeat...

References.

There are absolutely no guarantees that the algorithm converges. For further info on convergence and a clear exposition on the perceptron to supplement your readings, see p192-196 of "Pattern Recognition and Machine Learning" by Christopher Bishop. Lecture 12 of this series here is also excellent. In particular, the latter contains a neat result that can be proved using elementary tools - a lower bound on the total number of misclassifications the perceptron can make on a dataset under certain conditions, due to Novikoff and Block (1962).

Related Question