I have geometric explanation. Think of SVM as a maximum margin classifier. In that sense we seek separating hyperplane which will be equidistant from all negative and all positive examples. This includes that the distance from hyperplane from the closest to it's negative example would be as large as the distance to the closest positive. Let $w^*$ be known, then
$$\max_{i: y^{(i)}=-1} w^{*T}x^{(i)}$$
is the closest (worst case) distance from all possible negative examples. Similarly
$$\min_{i: y^{(i)}=1} w^{*T}x^{(i)}$$
is the closest (worst case) distance from all possible positive examples. How can we choose intercept so that the worst case distance for all (worst case) examples is maximum? Yes, we take the average of two.
The '-' sign.
Strictly speaking, $\max_{i: y^{(i)}=-1} w^{*T}x^{(i)}$ is not a distance because it is negative, while $\min_{i: y^{(i)}=1} w^{*T}x^{(i)}>0$. So in order to bring hyperplane from the worst negative to the worst positive direction we need the '-' sign.
First, let's calculate the norm $||w||^2$.
$$||w||^2 = \sum_i \alpha_iy_i\big(\sum_j\alpha_jy_j\langle x_i,x_j\rangle\big)$$
which evidently can be rearranged to $\sum_i\sum_j\alpha_i\alpha_jy_iy_j\langle x_i,x_j\rangle$.
The $\langle x_i, x_j\rangle$ construct is present because it's assumed that the norm is defined in terms of the inner product - every inner product induces a norm by the formula $||z||^2 = \langle z,z \rangle$ - so when we calculate $||w||^2$ (making the desired substitution from above) we use $\langle x_i,x_j \rangle$. The reason we don't get something like:
$$\sum_i \sum_j \langle \alpha_i y_i x_i, \alpha_j y_j x_j \rangle$$
is because the inner product is defined on $x$, and everything else is just a scalar multiplier, so, by basic properties of inner products, can get moved outside of the $\langle \rangle$.
Now, substituting into $\sum_i\alpha_i[y_i(\langle w, x_i\rangle+b)-1]$ can be done in parts:
$$\sum_i\alpha_i[y_i(\langle w, x_i\rangle+b)-1] = \sum_i\alpha_iy_i\langle w, x_i\rangle + b\sum_i\alpha_iy_i - \sum_i\alpha_i$$
The last term on the r.h.s. evidently equals $-\sum_ia_i$, and the middle term equals $0$, as the second constraint is that $\sum_i\alpha_iy_i = 0$. Substituting in for the first term gives:
$$\sum_i\alpha_iy_i\langle w, x_i\rangle =\sum_i\alpha_iy_i\langle \sum_j\alpha_jy_jx_j, x_i\rangle = \sum_i\sum_j\alpha_iy_i\alpha_jy_j\langle x_i, x_j \rangle$$
where the step to the last term is by basic properties of inner products.
Having gotten this far, we need to (remember to) a) multiply $||w||^2$ by $1/2$, b) multiply the long second term by $-1$, and c) combine them:
$${1\over 2}\sum_i \sum_j \alpha_iy_i\alpha_jy_j\langle x_i,x_j\rangle - \sum_i \sum_j \alpha_iy_i\alpha_jy_j\langle x_i,x_j\rangle - 0 + \sum_i\alpha_i $$
which evidently reduces to the desired result
$$-{1\over 2}\sum_i \sum_j \alpha_iy_i\alpha_jy_j\langle x_i,x_j\rangle + \sum_i\alpha_i $$
Best Answer
Because ΞΎ is a vector, and in the first link you will see they derive it per every element. more specifically: $$ \frac{\partial C}{\partial \zeta_i} = C - \alpha_i - \beta_i $$
It can be maybe better understood if we look at the Lagrangian again:
$ L(w,b,\alpha,\beta) = ... C\sum_i^l\zeta_i - \sum_i^l\alpha_i\zeta_i - \sum_i^l\alpha_i\zeta_i$
$L(w,b,\alpha,\beta) = ... \sum_i^l({C\zeta_i -\alpha_i\zeta_i - \beta_i\zeta_i )} = ...\sum_i^l{\zeta_i(C - \alpha_i - \beta_i )}$
And from here we can see that the derivative with respect to $\zeta_i$ is 0 only if $C - \alpha_i - \beta_i=0$