[Math] Novikoff ‘s Proof for Perceptron Convergence

algorithmsmachine learningproof-explanation

In Machine Learning, the Perceptron algorithm converges on linearly separable data in a finite number of steps. One can prove that $(R/\gamma)^2$ is an upper bound for how many errors the algorithm will make. This is given for the sphere with radius $R=\text{max}_{i=1}^{n}||\vec{x}_i||$ and data $\mathcal{X}=\{(\vec{x}_i,y_i):1\le i\le n\}$ with separation margin $\gamma>0$ (assumed it is linearly separable).

I'm looking at Novikoff's proof from 1962. Let $\phi$ be the angle between $\vec{w}_t$ (weight vector after $t$ update steps) and $\vec{w}_*$ (the optimal weight vector). $||\vec{w}_*||$ is normalized to $1$.
The maximum number of steps is then bounded by:
$$\text{max}(\text{cos}^2\phi)=1\ge \left( \dfrac{\langle\vec{w}_t , \vec{w}_*\rangle}{||\vec{w}_t||\underbrace{||\vec{w}_*||}_{=1}} \right)^2$$
He then expands the numerator as
$$\langle\vec{w}_t , \vec{w}_*\rangle^2 = \langle\vec{w}_{t-1}+y\vec{x} , \vec{w}_*\rangle^2\stackrel{(1)}{\ge} (\langle\vec{w}_{t-1} , \vec{w}_*\rangle+\gamma)^2\stackrel{(2)}{\ge}t^2\gamma^2.$$
The first equality is true because is just take out the penultimate error. Why (1) is true is the first thing that puzzles me a bit. Is it because $\langle\vec{w}_*,y\vec{x}\rangle\ge\gamma$, i.e. the minimal margine $\gamma$ must always be greater than the inner product of any sample? And in (2) im completely lost, why this must be. In my skript, it just says "induction over $t,\vec{w}_0=0$".

As for the denominator, I have
$$||\vec{w}_t||=||\vec{w}_{t-1}+y\vec{x}||^2\stackrel{(3)}{\le}||\vec{w}_{t-1}||^2+R^2\stackrel{(2)}{\le}tR^2$$
which contains again the induction at (2) and also a new relation at (3), which is unclear to me.

In the end we obtain $$1\ge\dfrac{t^2\gamma^2}{tR^2}=t\left(\dfrac{\gamma}{R}\right)^2\Leftrightarrow t\le \left(\dfrac{R}{\gamma}\right)^2$$
what we wanted to prove.

tl;dr: Explain steps (1), (2), and (3).

Best Answer

In Novikoff's theorem, we assume that

i) The data is linearly separable: $$\forall(\vec{x}, y) \in \mathcal{X} \text{ } \exists \vec{w}_* \exists \gamma > 0: \langle\vec{w}_*, \vec{x}\rangle y \ge \gamma .$$ ii) The weights are updated following Hebb's rule: $$\text{if } \langle\vec{w}_{t-1},\vec{x}\rangle y < 0, \text{ then } \vec{w}_t \leftarrow \vec{w}_{t-1} + y\vec{x} .$$

Thus,

  1. $$\langle\vec{w}_t , \vec{w}_*\rangle^2 = \langle\vec{w}_{t-1}+y\vec{x} , \vec{w}_*\rangle^2 = (\langle\vec{w}_{t-1}, \vec{w}_*\rangle + \langle\vec{w}_*, y\vec{x}\rangle)^2 = (\langle\vec{w}_{t-1}, \vec{w}_*\rangle + \langle\vec{w}_*, \vec{x}\rangle y)^2 \ge (\langle\vec{w}_{t-1} , \vec{w}_*\rangle+\gamma)^2 .$$

  2. $$(\langle\vec{w}_{t-1}, \vec{w}_*\rangle + \langle\vec{w}_*, \vec{x}\rangle y)^2 = (\langle\vec{w}_{t-2}, \vec{w}_*\rangle + 2\langle\vec{w}_*, \vec{x}\rangle y)^2 = \ldots =$$

$$= (\langle\vec{w}_{0}, \vec{w}_*\rangle + t\langle\vec{w}_*, \vec{x}\rangle y)^2 = (\langle0, \vec{w}_*\rangle + t\langle\vec{w}_*, \vec{x}\rangle y)^2 \ge t^2\gamma^2.$$

  1. $$||\vec{w}_t||^2 = ||\vec{w}_{t-1} + y\vec{x}||^2 = ||\vec{w}_{t-1}||^2 + 2\langle\vec{w}_{t-1}, \vec{x}\rangle y + ||\vec{x}||^2 \le$$

$$\le ||\vec{w}_{t-1}||^2 + ||\vec{x}||^2 \le ||\vec{w}_{t-1}||^2 + R^2 \le \ldots \le ||\vec{w}_0||^2 + t^2R^2 = t^2R^2.$$