Solved – Derivation of Perceptron weight update formula

gradient descentmachine learningperceptron

I've started out studying Machine Learning and am currently reading up about how a single perceptron works. From the wikipedia page, my understanding is as follows: suppose we have an input sample $\mathbf{x} = [x_1, \ldots, x_n]^T$, an initial weight vector $\mathbb{w} = [w_1, \ldots, w_n]^T$. Let the true output corresponding to $\mathbf{x}$ be $y'$.

The output given by the perceptron is $y = f(\sum_{i=0}^n w_ix_i)$, where $w_0$ is the bias and $x_0=1$. If $\eta$ is the learning rate, the weights are updated according to the following rule:
$$\Delta w_i = \eta x_i(y'-y)$$

This is according to wikipedia. But I know the weights are updated on the basis of the gradient descent method, and I found another nice explanation based on the gradient descent method HERE. The derivation there results in the final expression for weight update:

$$\Delta w_i = \eta x_i(y'-y)\frac{df(S)}{dS}$$

where $S = \sum_{i=0}^{n}w_ix_i$. Is there a reason why this derivative term is ignored? There was another book that mentioned the same weight update formula as Wikipedia, without the derivative term. I'm pretty sure we can't just assume $f(S) = S$.

Best Answer

The difference is that the first formula is the derivation of just the output of a perceptron, while the second is the derivation of the non-linear activation of the perceptron.

When stacking perceptron layers (MLP - Multi-layered Perceptron), you have to add some non-linearity on the output of each layers, otherwise all the process is linear (and can be modeled with a single layer).

So the output of the perceptron (or more accurately, the input of the next layer) becomes:

$y=f(S(w))=f(\sum w_ix_i)$

The derivation will be as in your second formula.

If you are not using a non-linear activation (single layer), the output is:

$y=S(w)=\sum w_ix_i$

and the derivation is as in your first formula.

Best Answer

Related Solutions

Solved – Clarification about Perceptron Rule vs. Gradient Descent vs. Stochastic Gradient Descent implementation

Solved – From the Perceptron rule to Gradient Descent: How are Perceptrons with a sigmoid activation function different from Logistic Regression

Related Question