[Math] What do weights do in the perceptron rule

machine learningneural networks

In learning about the perceptron rule, it appears that $w$ is a vector of real weighted values. My question is, how can I intuitively understand what these weights are doing to every input value $x_i$? Example:

$$z = w_1x_1 +…+ w_mx_m$$

Why is it that initializing the weights to 0 or some other small number makes sense? Also, how are these getting updated (in the context of a training set)

Best Answer

Weights are used so that we can scale individual inputs.If input $x_{3}$ for example isn't contributing enough to the right classification the perceptron will assign a small value to diminish it's output signal.

Here's a geometrical way of thinking of it.

Taking the dot product of two vectors such as $a.u$ is essentially projecting one over the other.So you can think of it like this - $a$ casts a shadow onto $u$.

enter image description here

What we are interested in the perceptron case is how the weight vector shadows the input vector, or to put it simply how it scales our input.If the values are negative than it won't cast a shadow.That same value will be our signal at the end.At the end you have a rule - if $w$ is blocking even a tiny fraction of our sunlight (if it's a positive value) then it's true.

To your second question.Weights are initialized like that because it's faster to train this way.Back to our 'shadow' analogy - it will be easier for you to move a little bit $w$ and block the sunlight and if it's turns out it's a false prediction move it bit to the left and you're done! So it's a good thing to be at the origin initially.In terms of backpropagation the reason for weight initialization close to zero will be due to the fact that when you have a really big input the non-linear activation function will have a really small derivative as the slope will be almost parallel to the asymptote, hence slow learning.

And finally your last question about the update rule.Delta rule is commonly used and it's really simple.

Change in Weight i = Learning Rate × Current Value of Input i × (Expected Output - Current Output)

or in mathematical language :

$\Delta w_{i}= \varepsilon .x_{i}.(e-y)$

updated weight :

$w_{i}= \Delta w_{i} + w_{i}$

If $(e-y)$ term is correctly predicted then it will be 0 and no update will be made to the weight.

Related Question