Show that the gradient of the smoothed surrogate loss function leads to perceptron update

gradientgradient descentloss-functionsperceptronsmoothing

This is about the contents of section 1.2.1 and 1.2.1.1 of the book "Neural Networks and Deep Learning: A Textbook". The link to the sections is here. The question arises from the following sentence on page 9-10 underlined in red:

I would like to derive the gradient of Equation (1.8) and show that the derivative will lead to the perceptron update. As the sentence says, the "perceptron update" is $\bar W\Leftarrow\bar W-\alpha\nabla_W L_i$. Another equation of "perceptron update" is Equation (1.4) on page 7. The two equations imply that $(y_i-\hat y_i)\bar X=\nabla_W L_i$ if we add the subscripts for the $i$-th sample. However, my derivation cannot lead to this equation. My derivation is as follows:

From (1.8), I omit the $\max\{\ldots\}$ operator and take partial derivative of $-y_i(\bar W\cdot\bar X)$ with regard to parameter $w_j$, which produces $-y_ix_j$. So, the vector $\nabla_W L_i$ should be $(-y_ix_1, -y_ix_2,\ldots, -y_ix_d)^T=-y_i\bar X$ if we stack all $j$'s. However, it is not $(y_i-\hat y_i)\bar X$ that is the perceptron algorithm in the right hand of Equation (1.4), unless prediction $\hat y_i$ is always 0 which can't be the case.

So, my questions are: 1) How to derive the gradient of Equation (1.8) that leads to the perceptron update? 2) What is the predicted value $\hat y$? Should it be $\mathrm{sign}\{\bar W\cdot\bar X_i\}$ or just $\bar W\cdot\bar X_i$?. 3) Should I drop the $\max\{\ldots\}$ operator when taking derivative? If should, why can we drop it? It's not differentiable at 0 and they are equal only for positive component.

I know this is only an encouragement in the book, but I hope I could understand how to derive it. thank you.

Best Answer

OK, let me answer my own questions. Let's start with the third question.

3 ) The reason why we can drop the $\max\{\ldots\}$ is that we only "sample" the value of the gradient during training. Just like ReLU, there are points where we cannot take differentiation, but we don't need the global differentiation. What we need is only a "local" value of the gradient. At a local region (which always exists), $\max\{-y_i(\bar W\cdot\bar X_i),0\}$ is either $-y_i(\bar W\cdot\bar X_i)$ (when misclassified) or $0$ (under correct classification).

1 ) My derivation in the question is correct, i.e., $\nabla_WL_i=-y_i\bar X$. Plugging it into gradient descent formula $\bar W\Leftarrow\bar W-\alpha\nabla_WL_i$, we get $\bar W\Leftarrow\bar W+\alpha y_i\bar X$. But comparing to Equation (1.4) of the text, they are different at first glance. Next I'll try to reconcile this disparity. We discuss by cases. When the classification is correct, $y_i=\hat y_i=\mathrm{sign}\{\bar W\cdot\bar X_i\}$. In this case, the perceptron criterion $L_i$ is $0$ (locally), so is its gradient $\nabla_WL_i$. As a result, $\bar W$ does not update. It is the case in Equation (1.4) too because $(y_i-\hat y_i)=0$, a consistency. On the other hand, my derivation in the question is applicable under misclassification because the perceptron criterion $L_i=-y_i(\bar W\cdot\bar X_i)$ in this case. Under misclassification, how do we show that $\bar W\Leftarrow\bar W+\alpha y_i\bar X$ is consistent with the perceptron update $\bar W\Leftarrow\bar W+\alpha(y_i-\hat y_i)\bar X$ in Equation (1.4)? Well, one case of misclassification is observation $y_i=+1$ while prediction $\hat y_i=-1$. In this case, it can be seen $\alpha y_i\bar X=\alpha\bar X$, while the corresponding term in Equation (1.4) is $2\alpha\bar X$. The constant coefficient 2 can be absorbed into $\alpha$, so the two update formulas are effectively the same. This is true for another misclassification where $y_i=-1$ while $\hat y_i=+1$. In sum, we verified that the gradient decent $\bar W\Leftarrow\bar W-\alpha\nabla_WL_i$ based on smoothed surrogate leads to perceptron update in Equation (1.4).

2 ) As mentioned in answer to 1), the predicted value $\hat y_i$ is equal to $\mathrm{sign}\{\bar W\cdot\bar X_i\}$.

Best Answer

Related Solutions

Solved – Deriving gradient of a single layer neural network w.r.t its inputs, what is the operator in the chain rule

Solved – Derivation of Perceptron weight update formula

Related Question