Solved – Derivation of Regularized Linear Regression Cost Function per Coursera Machine Learning Course

regressionself-study

I took Andrew Ng's course "Machine Learning" via Coursera a few months back, not paying attention to most of the math/derivations and instead focusing on implementation and practicality. Since then I have started going back to study some of the underlying theory, and have revisited some of Prof. Ng's lectures. I was reading through his lecture on "Regularized Linear Regression", and saw that he gave the following cost function:

$$J(\theta) = \frac{1}{2m}[\sum_{i=1}^m(h_\theta (x^{(i)}) – y^{(i)})^2 + \lambda\sum_{j=1}^n\theta^2_j]$$

Then, he gives the following gradient for this cost function:

$$\frac{\partial}{\partial \theta_j}J(\theta) = \frac{1}{m}[\sum_{i=1}^m(h_\theta (x^{(i)}) – y^{(i)})x^{(i)}_j – \lambda\theta_j]$$

I am a little confused about how he gets from one to the other. When I tried to do my own derivation, I had the following result:

$$\frac{\partial}{\partial \theta_j}J(\theta) = \frac{1}{m}[\sum_{i=1}^m(h_\theta (x^{(i)}) + y^{(i)})x^{(i)}_j + \lambda\theta_j]$$

The difference is the 'plus' sign between the original cost function and the regularization parameter in Prof. Ng's formula changing into a 'minus' sign in his gradient function, whereas that is not happening in my result.

Intuitively I understand why it's negative: we are reducing the theta parameter by the gradient figure, and we want the regularization parameter to reduce the amount that we are changing the parameter to avoid overfitting. I am just a little stuck on the calculus that backs this intuition.

FYI, you can find the deck here, on slides 15 and 16.

Best Answer

Actually if you check the lecture notes just after the video , it shows the formula correctly . The slides that you have lined here shows the exact slide of the video.

Related Solutions

Solved – How to compute the partial derivative of the cost function of mean regularized multi task learning

Just some general advice

try to limit the indexation where possible, and use matrix algebra
As the dimension of $Y_i$ varies with $i$ best not to store as a matrix. Treat $i$ separately.
Alternatively, you could define Y as one very long vector $Y=(Y_1^T,\dots,Y_m^T)^T$. Similarly, $X$ would have $\sum_{i=1}^{m}n_i$ rows and $d$ columns. But then $W$ needs to be redefined as the driect sum $W=\oplus_{i=1}^{m}W_i$. But then $W$ now has structural zeros. Too complicated to work with...
don't use indices more than once. For example you use $w_{j,k}$ and also use $j,k$ as summation variables. Should use $w_{r,s}$ instead

So I would write your cost function as $$J=\frac{1}{2}\sum_{i=1}^{m}(X_iW_i-Y_i)^T (X_iW_i-Y_i) +\lambda (W_i-\overline{W})^T (W_i-\overline{W})$$ Where $\overline{W}=\frac{1}{m}\sum_{i=1}^{m}W_i$ Now using the chain rule we have $\frac{\partial e^Te}{\partial W_r}= 2\frac{\partial e}{\partial W_r} e$

So you have $$ \frac{\partial J}{\partial W_r} =X_r^T (X_rW_r-Y_r) +2\lambda \sum_{i=1}^{m} \left(\frac{\partial W_i}{\partial W_r}- \frac{\partial \overline{W} }{\partial W_r}\right)(W_i-\overline{W})$$ $$ =X_r^T (X_rW_r-Y_r) +2\lambda (W_r-\overline{W})$$

This is not the answer you have.

update

One way you can re-express the equations is by setting $ X=\oplus_{i=1}^mX_i $ (which has $\sum_{i=1}^mn_i $ rows and $ dm $ colums, and $ w=(W_1^T,\dots, W_m^T)^T $ and $ Y=(Y_1^T,\dots, Y_m^T)^T $. We can also re-express the penalty term as $\sum_{i=1}^m(W_i-\overline {W})^T (W_i-\overline {W})=w^Tw-m\overline {W}^T\overline {W} =w^T (I-m^{-1}G^TG) w $ where $ G$ is the $ d\times md $ matrix which calculates the totals for $ w $. So the $ k $ th row of $ G $ has ones in columns $k, d+k, 2d+k, \dots, (m-1) d+k $ and zeroes everywhere else. We can also write the other factor as $\frac{1}{2}(Y-Xw)^T (Y-Xw) $. Hence an explicit solution is given as

$$\hat {w}=\left [X^TX +2\lambda (I-m^{-1} G^TG)\right]^{-1} X^TY $$

It will probably be more efficient to implement by first use the woodbury matrix identity, as $ X^TX $ is a block- diagonal matrix.

Solved – Coursera Linear Regression with gradient descent with R

Answer to your question 1

Appending $\bf 1$ to matrix $\bf X$ is adding the "intercept" term. Suppose you have $p$ features in data, without adding $\bf 1$ term, you are actually fitting $$y=\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$

With appending $\bf 1$, you are fitting.

$$y=\theta_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_px_p$$. Similarly, you can try following two models in R to see difference

lm(mpg~wt,data=mtcars)

vs.

lm(mpg~wt-1,data=mtcars)

where first formula gives a fit of $mpg=\theta_0+\theta_1*wt$ and second formula gives a fit of $mpg=\theta_1*wt$

Answer to your question 2

In the Coursera course Andrew Ng spent a lot of time on iterative methods, instead of "analytical solution" / "normal equations". Which means he is teaching you how to get those weights by some algorithms that runs in many iterations. Rather than deriving the answers from algebra and calculate the weights from the formula.

In linear regression case, using iterative method may not be necessary, (in fact R is not using it, R is using QR decomposition and solve it directly instead of gradient decent), but in many other complicated models, say neural network, iterative methods are extremely useful.

PS. just found that more information can be found in this related post