In terms of linear separability: using a bias allows the hyperplane that separates the feature space into two regions to not have to go through the origin. Without a bias, any such hyperplane would have to go through the origin, and that may prevent the separability we want.
Simple example: suppose we have two inputs $x$ and $y$ that can take on the values $0$ or $1$, and we want the output to be a $1$ when both inputs are $1$ (an $\land$ logic circuit, basically). This means that the separating hyperplane cannot go through the origin of the feature space (where $x$ and $y$ are $0$). But without using a bias (or, equivalently, by having a threshold of $0$), you can't move the hyperplane (the set of points $(x,y)$ for which $x\cdot w_1 + y \cdot w_2=0$ given some weights $w_1$ and $w_2$) away from the origin, because for any weights, the origin is a solution to this equation.
Some strategies I have found useful when self-studying ML (before I found out about stackexchange), is that if you are struggling with a particular author's presentation, consult other reputable sources on the same topic, as another author may present it in a slightly different way that might be more amenable to how you are thinking about the problem. Furthermore, I've found that studying this way allows you to get a more well-rounded perspective, as different authors, depending on their field of specialisation, will often emphasise different aspects of the same topic.
I haven't seen this particular presentation before, but much of the part on how 0/1 loss is incompatible with gradient-based methods, due to discontinuity/non-differentiability coheres with what I know from the sources I used to learn this.
Loss function - intuition.
Where there is a minor difference, but one which has consequences for intuition of what is going on, is the specification of the perceptron update criterion in terms of the loss function per training point outlined here:
$$L_i = \max\{-y_i (\bar{W} \cdot \bar{X}_i), 0 \}$$
Crucially, the author hasn't been entirely explicit about emphasising the fact that contributions to gradient updates only occur for misclassified points (although that can be deduced from the above). As that might be unclear at this stage, here is how it works. Let's rewrite the loss per training point piece-wise:
$$L_i =\max\{-y_i (\bar{W} \cdot \bar{X}_i), 0 \} =
\begin{cases}
-y_i(\bar{W} \cdot \bar{X}_i) &\text{if} \space -y_i(\bar{W} \cdot \bar{X}_i) > 0 \\
0 &\text{otherwise} \\
\end{cases}$$
Now for correctly classified points, $L_i = 0$. Why? Because your label $y_i \in \{-1, +1\}$ will have the same sign as your prediction $\bar{W} \cdot \bar{X}_i$, due to the classification being correct. This means that $y_i(\bar{W} \cdot \bar{X}_i) > 0$ and hence $-y_i(\bar{W} \cdot \bar{X}_i) < 0$, and $L_i = 0$.
Now for incorrectly classified points, $L_i = -y_i(\bar{W} \cdot \bar{X}_i)$. Why? The sign of the label $y_i$ will be different to your prediction $\bar{W} \cdot \bar{X}_i$, and so $y_i(\bar{W} \cdot \bar{X}_i) < 0$, meaning that $-y_i(\bar{W} \cdot \bar{X}_i) > 0$, and hence $L_i = -y_i(\bar{W} \cdot \bar{X}_i)$.
Hopefully, this clarifies the fact that the loss function $L_i$ is assigning a loss of $0$ to correctly classified points, and a loss of $-y_i(\bar{W} \cdot \bar{X}_i)$ to incorrectly classified points.
Stochastic gradient descent and the gradient of the loss function.
Recall we want to set $\bar{W}$ to minimise the sum of individual losses over the entire data set, that is we want to minimise $\sum^n_{i=1} L_i$. In the case of the perceptron, because we cannot just minimise $\sum^n_{i=1} L_i$ analytically by solving $\nabla_{W}\sum^n_{i=1} L_i = 0$, we use stochastic gradient descent.
So the expression $\bar{W} \leftarrow \bar{W} - \alpha \nabla_{W} L_i$ just refers to iteratively updating the parameter vector using stochastic gradient descent. It is stochastic because you are using only one training point $(\bar{X}_i, y_i)$ at a time to update the parameter vector $\bar{W}$. $\alpha$ is just a learning rate parameter.
Where gradients have been taken is in $\nabla_{W} L_i$, which is the gradient of the loss on an individual training point.
Now, to see that contributions to gradient updates of the parameter only occur for misclassified points, notice that when a point is correctly classified $L_i = 0 \implies \nabla_{W} L_i = 0$, and that when a point is incorrectly classified $L_i = -y_i (\bar{W} \cdot \bar{X}_i) \implies \nabla_{W} L_i = -y_i \bar{X}_i$.
Yielding, for misclassified points only:
$$\bar{W}^{(t+1)} = \bar{W}^{(t)} - \alpha \nabla_{W}L_i = \bar{W}^{(t)} + \alpha y_i \bar{X}_i$$
Where $t$ refers to iteration number. How does this yield the perceptron algorithm? Start with an initial guess $\bar{W}^{(0)}$. Out of all misclassified points, i.e. those points where the sign of the label $y_i$ is not the same as the sign of your prediction $\bar{W}^{(0)} \cdot \bar{X}_i$, randomly select one and update $\bar{W}$ using the above formula to get $\bar{W}^{(1)}$, and repeat...
References.
There are absolutely no guarantees that the algorithm converges. For further info on convergence and a clear exposition on the perceptron to supplement your readings, see p192-196 of "Pattern Recognition and Machine Learning" by Christopher Bishop. Lecture 12 of this series here is also excellent. In particular, the latter contains a neat result that can be proved using elementary tools - a lower bound on the total number of misclassifications the perceptron can make on a dataset under certain conditions, due to Novikoff and Block (1962).
Best Answer
Weights are used so that we can scale individual inputs.If input $x_{3}$ for example isn't contributing enough to the right classification the perceptron will assign a small value to diminish it's output signal.
Here's a geometrical way of thinking of it.
Taking the dot product of two vectors such as $a.u$ is essentially projecting one over the other.So you can think of it like this - $a$ casts a shadow onto $u$.
What we are interested in the perceptron case is how the weight vector shadows the input vector, or to put it simply how it scales our input.If the values are negative than it won't cast a shadow.That same value will be our signal at the end.At the end you have a rule - if $w$ is blocking even a tiny fraction of our sunlight (if it's a positive value) then it's true.
To your second question.Weights are initialized like that because it's faster to train this way.Back to our 'shadow' analogy - it will be easier for you to move a little bit $w$ and block the sunlight and if it's turns out it's a false prediction move it bit to the left and you're done! So it's a good thing to be at the origin initially.In terms of backpropagation the reason for weight initialization close to zero will be due to the fact that when you have a really big input the non-linear activation function will have a really small derivative as the slope will be almost parallel to the asymptote, hence slow learning.
And finally your last question about the update rule.Delta rule is commonly used and it's really simple.
Change in Weight i = Learning Rate × Current Value of Input i × (Expected Output - Current Output)
or in mathematical language :
$\Delta w_{i}= \varepsilon .x_{i}.(e-y)$
updated weight :
$w_{i}= \Delta w_{i} + w_{i}$
If $(e-y)$ term is correctly predicted then it will be 0 and no update will be made to the weight.