Solved – Derivation of Regularized Linear Regression Cost Function per Coursera Machine Learning Course

regressionself-study

I took Andrew Ng's course "Machine Learning" via Coursera a few months back, not paying attention to most of the math/derivations and instead focusing on implementation and practicality. Since then I have started going back to study some of the underlying theory, and have revisited some of Prof. Ng's lectures. I was reading through his lecture on "Regularized Linear Regression", and saw that he gave the following cost function:

$$J(\theta) = \frac{1}{2m}[\sum_{i=1}^m(h_\theta (x^{(i)}) – y^{(i)})^2 + \lambda\sum_{j=1}^n\theta^2_j]$$

Then, he gives the following gradient for this cost function:

$$\frac{\partial}{\partial \theta_j}J(\theta) = \frac{1}{m}[\sum_{i=1}^m(h_\theta (x^{(i)}) – y^{(i)})x^{(i)}_j – \lambda\theta_j]$$

I am a little confused about how he gets from one to the other. When I tried to do my own derivation, I had the following result:

$$\frac{\partial}{\partial \theta_j}J(\theta) = \frac{1}{m}[\sum_{i=1}^m(h_\theta (x^{(i)}) + y^{(i)})x^{(i)}_j + \lambda\theta_j]$$

The difference is the 'plus' sign between the original cost function and the regularization parameter in Prof. Ng's formula changing into a 'minus' sign in his gradient function, whereas that is not happening in my result.

Intuitively I understand why it's negative: we are reducing the theta parameter by the gradient figure, and we want the regularization parameter to reduce the amount that we are changing the parameter to avoid overfitting. I am just a little stuck on the calculus that backs this intuition.

FYI, you can find the deck here, on slides 15 and 16.

Best Answer

Actually if you check the lecture notes just after the video , it shows the formula correctly . The slides that you have lined here shows the exact slide of the video.

enter image description here

Related Question