[Math] Step by step LMS for learning a linear function

algorithmsleast squaresmachine learning

Disclaimer

Since this is an exercise assignment I'm not looking for a complete solution but for help that enables me to solve it on my own

The task

Given the error function

$E=\sum_{<b,V_{train}(b)>} (V_{train}(b)-V'(b) )² $

with

$b$ being the variable to be predicted by our function

$V_{train}(b)$ being the training value

$V'(b)$ being the the learned function

try to learn a linear function to approximate the true function.
There 5 given training examples: $(1,1) , (3,3),(5,4),(7,3),(9,5)$

a) Learn the function by using the LMS algorithm (η = 0.1). Only
present each example once, in the order given by the above list. As
initialization use the following linear function: y = x.

b) If all 5 training examples were given in advance, how can the best approximated
linear function be directly calculated? What is it? (Hint: Refresh your
mathematical knowledge on determining the minimum of a function, as you want
to minimize the error function.)

c) Compare both results. What do you observe? Explain.

Guidance from the lecture

The lecture slides propose in a similar task the following course of action:

For each training example $<b,V_{train}(b)>$ :

Calculate: $error(b)= V_{train}(b)-V'(b)$

For each weight $w_i$ calculate:

$w_i\leftarrow w_i + η * x_i * error(b)$

In addition to this it's mentioned earlier that

$V'(b)= w_0 + w_1 x_1 \dots$

where I'm leaving the rest of $x_i$ out because we only have $x_1$ in this task.

So, looking back at all this we seem to have everything except $V'(b)$ with the weights being unknown.

Question

How do I get started?

Putting in all known variables I'm left with the weights unkown, how am I to get them?

Edit: Related to LMS I found this formula (adapted to the variables here):

$sgn(w^T x_i + w_0 ) = V_{train}(b) $

what is $w^T$ here?

If anything is unclear or I forgot something please leave a comment and I'll take care of it.

Best Answer

Maybe I will make some changes in notation to clarify. To brief answer go to the end of the anwswer (Summarizing).

We need to find the parameters $w_i$, $i=1,2,\dotsc$, for a linear function (known as hyptohesis function). $$h_w(x)=w_0+w_1x$$ where $x$ is your input training value $V_{train}(b)$ and $h_w(x)$ is the output of your learned function $V'(b)$. Note that I using only $w_0,w_1$ since your training set have the form $(x,y)$; for linear regression with $n$ parameters you need a hyptohesis function like $$h_w(x)=w_0+w_1x_1+w_2x_2+\dotsm+w_nx_n.$$ Also, $w_0$ is your bias, axis intersection or intercept term, i.e., if the formula of linear function is $y=a+bx$, then here $w_0$ is equivalent to $a$.

Now you need to find the values of $w_0$ and $w_1$. To work less you can use matrix-vector operations. So if you have two vectors $\mathbf w, \mathbf x$ such that $$\mathbf w=\begin{pmatrix}w_0\\w_1\\\vdots\\w_n\end{pmatrix}\quad\text{and}\quad\mathbf x=\begin{pmatrix}x_0\\x_1\\\vdots\\x_n\end{pmatrix},$$ you can obtain their sum by $$\mathbf w^T\mathbf x=\begin{pmatrix}w_0\;w_1\;\dotsm\;w_n\end{pmatrix}\begin{pmatrix}x_0\\x_1\\\vdots\\x_n\end{pmatrix}=w_0x_0+w_1x_1+w_2x_2+\dotsm+w_nx_n,$$ where $\mathbf w^T$ is the transpose of $\mathbf w$. Note that we use $x_0=1$ to fix the intercept term $w_0$ (since $w_0x_0=w_01=w_0$). Thus, to simplify our notation, we can computed all trained values by $$h(\mathbf x)=\sum_{i=0}^n w_ix_i=\mathbf w^T\mathbf x.$$ Finally, suppose we have $m$ elements belongs to our training set. We need to define the cost function (error function) from each value of the paremeters $w$'s to error number in $\mathbf R$ (real numbers): $$E(w)=\frac{1}{2}\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})^2,$$ where $(x^{(i)},y^{(i)})\in \text{Training set}$ (i.e., $(x^{(i)}, y^{(i)})$ is the $i^{th}$-element of traning set. This to avoid confusion: upper-index to the $i^{th}$-element in training set. Lower-index to $i^{th}$-parameter in $h_w$).

LMS algorithm. We consider the gradient descent algorithm, which starts with some initial $w$, and repeatedly performs the update $$w_j:=w_j-\eta\frac{\partial}{\partial w_j}E(w).$$ (This update is simultaneously, i.e., first compute all errors, then update.)

In order to implement this algorithm to linear function case, we obtain for some $$\begin{array}{rcl}\frac{\partial}{\partial w_j}E(w)&=&\frac{\partial}{\partial w_j}\frac{1}{2}\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})^2\\&=&\frac{1}{2}\sum_{i=1}^m\frac{\partial}{\partial w_j}(h_w(x^{(i)}) - y^{(i)})^2\\&=&\frac{1}{2}\sum_{i=1}^m2(h_w(x^{(i)}) - y^{(i)})x_j\\&=&\frac{1}{2}2\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})x_j\\&=&\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})x_j.\end{array}$$ Thus, you must update the weight by $$w_j:=w_j+\eta\sum_{i=1}^m(y^{(i)}-h_w(x^{(i)}))x_j^{(i)}.$$

In similar spirit to $\mathbf w^T\mathbf x$, we can compute the error using matrix-vector (I write your case to right side). Suppose that your training set is $$X=\begin{pmatrix}-(x^{(1)})^T-\\-(x^{(2)})^T-\\\vdots\\-(x^{(m)})^T-\end{pmatrix}.$$ Also, let $\mathbf y$ be the $m$-dimensionl vector with your target values: $$\mathbf y=\begin{pmatrix}y^{(1)}\\y^{(2)}\\\vdots\\y^{(m)}\end{pmatrix}.$$ Now, since $h_w(x^{(i)})=(x^{(i)})^T\mathbf w$, we can verify $$\begin{array}{rcl}X\mathbf w -\mathbf y&=&\begin{pmatrix}(x^{(1)})^T\mathbf w\\\vdots\\(x^{(m)})^T\mathbf w\end{pmatrix}-\begin{pmatrix}y^{(1)}\\\vdots\\y^{(m)}\end{pmatrix}\\&=&\begin{pmatrix}(x^{(1)})^T\mathbf w-y^{(1)}\\\vdots\\(x^{(m)})^T\mathbf w-y^{(m)}\end{pmatrix}.\end{array} $$ Thus, using $\mathbf v^T\mathbf v = \sum_i\mathbf v_i^2$ for a vector $\mathbf v$, we obtain $$\begin{array}{rcl}E(w)&=&\frac{1}{2}\sum_{i=1}^m(h_w(x^{(i)}) - y^{(i)})^2\\&=&\frac{1}{2}(X\mathbf w-\mathbf y)^T(X\mathbf w-\mathbf y).\end{array}$$

Summarizing. To obtain $$\mathbf w=\begin{pmatrix}w_0\\w_1\end{pmatrix},$$ proceed:

Set your collections $x$ and $y$: $$x=\begin{pmatrix}1\\3\\5\\7\\9\end{pmatrix}\qquad y=\begin{pmatrix}1\\3\\4\\3\\5\end{pmatrix}.$$
Loop until you think that $w_j$ converge (try with $10$ iterations) {

for $i=1$ to $m$ { $$w_j:=w_j+\eta\sum_{i=1}^m(y^{(i)}-h_w(x^{(i)}))x_j^{(i)}\qquad(\text{for each }j)$$ }

}

Insted you can use

1'. Set your vectors $X$ and $\mathbf y$: $$X=\begin{pmatrix}1&1\\1&3\\1&5\\1&7\\1&9\end{pmatrix}\qquad\mathbf y=\begin{pmatrix}1\\3\\4\\3\\5\end{pmatrix}.$$ 2'. Loop until you think that $\mathbf w$ converge (try with $10$ iterations) { $$\mathbf w:=\mathbf w-\frac{\eta}{m}\left((X\mathbf w-\mathbf y)^TX\right)$$ }

Note. With the above it can prove a analytical formula to obtain the parameters, using the normal equations to computed $\mathbf w$ by $$\mathbf w=(X^TX)^{-1}X^T\mathbf y.$$

Obviously you need to use a computer software (like python or octave) or a calculator device (with matrix-vector support) to calculate the values.

I have discussed two ways, using iterations and using vectors. Also, I have avoided some steps about your data. Any questions, for aditional steps, please ask.

Disclaimer

The task

Guidance from the lecture

Question

Best Answer

Related Solutions

[Math] Batch vs incremental gradient descent

[Math] Relationship between Training Neural Networks and Calculus of Variations

Related Question