Solved – Coursera Linear Regression with gradient descent with R

gradient descentrregression

I am enrolled in a machine learning course for machine learning where we have a lab to implement linear regression
I am attempting to do it in R to get a better understanding of the material and of R for myself (i don't intend to submit this as a lab as the course doesn't use R) but am coming up against a wall

My understanding of the process is as follows

User Generates a model based on the hypothesis
$h_\theta(x) = \theta^TX= \theta_0x_0 +\theta_1x_1+\dots$
Take error rate of your model by using squared error cost function, then iterate, create a new hypothesis and get the error rate of this. Continue through $n$ iterations based on the formula
$J(\theta_0,\theta_1)=\frac{1}{2m}\displaystyle\sum_1^m(h_\theta(x^{(i)})−y^{(i)})^2$.
Take all the error rates you have recorded based on the cost history and use gradient descent to find automatically the optimal values of your hypothesis.

Using the code on R-Bloggers where the gradient descent is implement below based on vectors x and y

# add a column of 1's for the intercept coefficient
X <- cbind(1, matrix(x))

# gradient descent
for (i in 1:num_iters) {
 error <- (X %*% theta - y)
 delta <- (t(X) %*% error) / length(y)
 theta <- theta - alpha * delta
 cost_history[i] <- cost(X, y, theta)
 theta_history[[i]] <- theta
}

I was wondering if people could help me tease out the logic

Why is the number 1 applied to the matrix X. Is this so that X has 2 columns so that it can be multiplied by theta – y?
What is the formula delta actually calculating and why is the Transpose of X being used

Conceptually I think i understand the overall process but i just need to relate this back to the R code as i want to grasp the concept before proceeding to Multiple linear regression

Best Answer

Answer to your question 1

Appending $\bf 1$ to matrix $\bf X$ is adding the "intercept" term. Suppose you have $p$ features in data, without adding $\bf 1$ term, you are actually fitting $$y=\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$

With appending $\bf 1$, you are fitting.

$$y=\theta_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$. Similarly, you can try following two models in R to see difference

lm(mpg~wt,data=mtcars)

vs.

lm(mpg~wt-1,data=mtcars)

where first formula gives a fit of $mpg=\theta_0+\theta_1*wt$ and second formula gives a fit of $mpg=\theta_1*wt$

Answer to your question 2

In the Coursera course Andrew Ng spent a lot of time on iterative methods, instead of "analytical solution" / "normal equations". Which means he is teaching you how to get those weights by some algorithms that runs in many iterations. Rather than deriving the answers from algebra and calculate the weights from the formula.

In linear regression case, using iterative method may not be necessary, (in fact R is not using it, R is using QR decomposition and solve it directly instead of gradient decent), but in many other complicated models, say neural network, iterative methods are extremely useful.

PS. just found that more information can be found in this related post

Best Answer

Answer to your question 1

Answer to your question 2

Related Solutions

Solved – Gradient descent: compute partial derivative of arbitrary cost function by hand or through software

Solved – How does the Adam method of stochastic gradient descent work

Related Question