[Math] Batch vs incremental gradient descent

convex optimizationmachine learningoptimization

I am studying Machine Learning, but I believe you guys should be able to help me with this!

Basically, we have given a set of training data $\{(x_1,y_1), x(x_2,y_2), …, (x_n, y_n)\}$, and we need to train a perceptron to fit the data as best as possible. A perceptron here, is a simple model consisting of a weight vector $w$, and the perceptron outputs $w^Tx$ for an input $x$.

We define an error function $ E(w) = \frac{1}{2N} \sum_{i=1}^{N}(t_d – o_d)^2$, where $t_d – o_d$ is simply the difference between the ideal target value, and our output, $w^Tx$.

A way to minimize the error is to compute the gradient, and we obtain

$\frac{\partial E}{\partial w_i} = \frac{1}{N} \sum_{i=1}^{N}(t_d – o_d)(-x_{id})$

Now, in a computer algorithm, I can, on each iteration, update each $w_i$ to $w_i + \eta \frac{\partial E}{\partial w_i}$, but the problem is that computing that is slow, as it ranges over the whole training set of data.

So, what has been invented, is the LMS (Least Mean Squares) rule, that claims that

$\frac{1}{N} \sum_{i=1}^{N}(t_d – o_d)(-x_{id}) \approx (t_d – o_d)(-x_{id}) $

which means I can just use the current training example to perform my gradient descent.

Now, after this intro, I would like to ask for a bit more intuition and formality behind this LMS rule, and why it is a good enough approximation. I guess I would like a bit more explanation on the $\approx$ part of the above equation, and when, and how much it holds. Thanks for any help.

Best Answer

One thing you're missing is that typically perceptrons are formulated as binary classifiers. There is typically a threshold on $w^Tx$, e.g. $\mathrm{sign}(w^Tx)$, whereby $t_d$, $o_d$ are 1 or -1 (or equivalently 0 or 1 if you use $\mathbb{I}(w^Tx > 0)$; it effectively works out the same).

The short answer is that it's not a great approximation, in an absolute sense. It's guaranteed to converge to some weight vector that yields zero classification error, if any such vector exists. There are no guarantees about how long it will take you to get there (and there's no guarantee that any single step will always make your error rate go down), and in general such methods (this is an instance of a more generally applicable method known as "stochastic gradient descent") have terrible asymptotic convergence properties (but are still used for other reasons).

The notes from Geoff Hinton's undergrad course have some helpful insight on the matter (with the necessary SVM-bashing). If you want a formal proof just Google for "perceptron convergence" "proof". It follows from the fact that if the classes are linearly separable then the set of weight vectors that get zero training error (there will typically be infinitely many) form a convex region in weight space.

Related Question