[Math] Implementing gradient descent based on formula

gradient descentlinear algebramachine learning

The gradient descent algorithm is given as :

repeat
{

$$\displaystyle \theta_j := \theta_j – \frac{1}{m} \alpha \sum_{i=1}^m (h_\theta(x^{(i)}) – y^{(i)}) x^{(i)}_j $$

}

Given these values :

x,y
1,1
2,2
3,3
4,4

$\alpha$ (learning rate) $= 1$

$m$ (number of training exmaples) $= 4$

$h_\theta(x) = \theta_0 + \theta_1x_1$

Setting $\theta_0$ to $0$ and $\theta_1$ to $1$

Is this a correct implementation of this algorithm ? :

$$\theta_1 := \theta_1 – (1) \frac{1}{4}(0 + ((1)(1) – 1)+ ((1.5)(2) – 2)+ ((2)(3) – 3)+ ((2.5)(4) – 4)$$

This excludes the $x_j^{(i)}$ part as I'm not sure what value this should take. As $\theta_j$ is the feature number and there is just one feature in this example : $x$. Therefore $x_j^{(i)}$ should be 1 but this does not make sense as this just makes use of 1 feature parameter ?

Update :

Take this extended example, is still correct ? :

Updated gradient descent algorithm to be more explicit : replacing $h_\theta(x^{(i)})$ with $h_\theta(x_j^{(i)})$

repeat
{

$$\displaystyle \theta_j := \theta_j – \frac{1}{m} \alpha \sum_{i=1}^m (h_\theta(x_j^{(i)}) – y^{(i)}) x^{(i)}_j $$

}

Data set with 2 examples and each examples has 3 features :

$$\begin{array}[t]{c|c|c}
x_1 & x_2 & x_3 & y\\
\hline
1 & 2 & 1& 1\\
4 & 2 & 5& 2
\end{array}
$$

As x now contains multiple features the hypothesis function is now : $h_\theta(x) = \theta^TX$

Initialising the $\theta$ matrix :
$$\begin{array}[t]{c|c|c}
\theta_0 &\theta_1 & \theta_2 & \theta_3 \\
\hline
0 & 1 & 2 & 1\\
0 & 4 & 2 & 5
\end{array}
$$

In this example have set learning rate to .5

1'st iteration :

$$
\theta_{0}:= -\frac 12 (.5) [( (1*4) + (4*1)+ (2*2) – 1)+ ((2*2)+ (5*1)+ (1*5) – 2)](1)= -5.75
$$
$$
\theta_{0}=-5.75
$$

How is $\theta_{1}$ calculated ?

Update 2 :

each theta value has been picked at random

$
\theta=\begin{bmatrix}
1\\
3\\
2\\
1
\end{bmatrix}
$

Dataset including column of ones :

$$\begin{array}[t]{c|c|c|c}
x_0 & x_1 & x_2 & x_3 & y\\
\hline
1 & 1 & 2 & 1& 1\\
1 & 4 & 2 & 5& 2
\end{array}
$$

Again :

gradient descent algorithm – including parenthesis of what is summed :

repeat
{

$$\displaystyle \theta_j := \theta_j – \frac{1}{m} \alpha \bigg(\sum_{i=1}^m ((h_\theta(x_j^{(i)}) – y^{(i)}) x^{(i)}_j)\bigg) $$

}

$h_\theta(x) = \theta^TX$

After the first iteration there should be computed values for : $
\theta_{0}\theta_{1}\theta_{2}\theta_{3}$ ?

Is this then correct : ?

$x_0^{(0)}$=1

$x_0^{(1)}$=1

$x_1^{(0)}$=1

$x_1^{(1)}$=4

$x_2^{(0)}$=2

$x_2^{(1)}$=2

$x_3^{(0)}$=1

$x_3^{(1)}$=5

$$
\theta_{0}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*1)+ ((1*1)+ (3*4)+ (2*2) + (1*5) – 2 * 1)]
$$
$$
\theta_{1}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*1)+ ((1*1)+ (3*4)+ (2*2) + (1*5) – 2 * 4)]
$$
$$
\theta_{2}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*2)+ ((1*1)+ (3*4)+ (2*2) + (1*5) – 2 * 2)]
$$
$$
\theta_{3}:= -\frac 12 (.5) [( (1*1) + (3*1)+ (2*2) + (1*1)-1*1)+ ((1*1)+ (3*4)+ (2*2) + (1*5) – 2 * 5)]
$$

So every theta calculation is same except $x_j^{(i)}$ is updated

Best Answer

Firstly, let's make the convention that $x = x_1$ and $x_1^{(i)}$ expresses the value of $x_1$ at the $i-$ row (see example) or equivalently $x_1^{(i)}$ is the value of $x_1$ of the $i-$th training example. We notice that $x^{(1)}_1 = 1.$ We denote $^{k}\theta_j$ the value of $\theta_ j$ after $k$ updates (i.e. after $k$ repetitions of the algorithm). After one update we have: $$ \ ^{1}\theta_{0}:= \, ^{0}\theta_{0} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_0^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_0^{(1)} \bigg]= 0, $$ since $^{0}\theta_0 = 0,$ $^{0}\theta_1 = 1$ and $x_0^{(i)} = 1.$ $$ \ ^{1}\theta_{1}:= \, ^{0}\theta_{1} - \frac 14 \bigg[\left(^{0}\theta_0+\,^{0}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)} + \left(^{0}\theta_0+\,^{0}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1. $$

Thus, $^{1}\theta_1 = 1$ and $^{1}\theta_0 = 0.$ Notice that if we want to proceed to the next iteration, we need both $\theta_0$ and $\theta_1$ which we found at the previous step. So: $$^{2}\theta_1 = : \, ^{1}\theta_{1} - \frac 14 \bigg[\left(^{1}\theta_1x_1^{(1)}-y^{(1)}\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(2)}-2\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(3)}-y^{(3)}\right)\cdot x_1^{(1)} + \left(^{1}\theta_0+\,^{1}\theta_1x_1^{(4)}-y^{(4)}\right)\cdot x_1^{(1)} \bigg]= 1$$

So, regardless how many updates we apply, the value of $\theta_1$ will be constantly equal to $1,$ since at every iteration we have $\theta_0 = 0$ and $\theta_1 = 1.$

About update 2: Here is What I would do if I were in your shoes. First of all, I would calculate separately $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)}),$ where $\theta $ is our initial vector $\theta=[ 1 \quad 3 \quad 2 \quad 1]^T.$

We have:

$$h_\theta(x^{(1)})= 1\cdot 1 + 3\cdot 1 +2\cdot 2 + 1\cdot 1 = 9 $$

$$h_\theta(x^{(2)})= 1\cdot 1 + 3\cdot 4 + 2\cdot 2 + 1\cdot 5 = 22. $$

Thus, if we implement the algorithm, we get: $$\begin{array}[t]{l} \theta_0 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1 +(22-2)\cdot 1 \right]=-6\\ \theta_1 = 3 - \frac{0.5}{2}\cdot\left[(9-1)\cdot 1 + (22-2)\cdot 4\right]=-19\\ \theta_3 = 2 -\frac{0.5}{2}\cdot \left[(9-1)\cdot 2+(22-2)\cdot 2\right]=-12\\ \theta_4 = 1-\frac{0.5}{2}\cdot \left[(9-1)\cdot 1+(22-2)\cdot 5\right]=-26 \end{array} $$

Thus, after one update the new $\theta=[-6 \quad -19 \quad -12 \quad -26]^T.$

If you want to apply the algorithm once again, evaluate the new $h_\theta(x^{(1)})$ and $h_\theta(x^{(2)})$ and proceed as before.

Notice that the prof uses the convention: $$\begin{array}[t]{c | c | c} x_0 & x_1 & x_2 & x_3\\ \hline x_0^{(1)} = 1 & x_1^{(1)} = 1 &x_2^{(1)} = 2 & x_3^{(1)} = 1\\ x_0^{(2)} = 1 & x_1^{(2)}=4 & x_2^{(2)} = 2 & x_3^{(2)} =5 \end{array} $$

Best Answer

Related Solutions

[Math] Partial derivative in gradient descent for two variables

[Math] we update simultaneously all the variables in Gradient Descent

Related Question