Solved – Motivation for gradient descent method over canonical method (for OLS/MLE) for simple linear regression

gradient descentleast squaresmaximum likelihood

I am beginner in machine learning and I am currently trying to find the motivation for gradient descent method.
I am confused why we want to employ gradient descent method for linear regression? I see the cost function the same as the OLS function, and gradient descent method here actually takes more effort than simply getting the derivatives equal zero.
Then why we always try to use gradient descent here? I am when the model gets more complicated , and also when we make more assumptions on the prior distribution of the theta(parameters). The optimization problem will become much more complicated.
Then will gradient descent method still survive in terms of this? And OLS/MLE method will not be able to predict the parameters?
I see OLS as minimize the cost, and the MLE method as maximize the prob, which is in essence the same.(reference http://www.cs.ubc.ca/~nando/540-2013/lectures/l3.pdf) Should I think gradient descent method as a improvement from the OLS method, while the E-M method(maximize the expected likelihood) as a imporvement from the MLE method.
Thanks in advance!

Best Answer

For ordinary linear regression, maximum likelihood and least squares are the same, i.e., give the same answer (the maximum likelihood solution is the least squares solution, if you derive the so called ``normal equations'' you'll see this, also see the book The Elements of Statistical Learning which discusses this).

But this is separate from how you find that solution. Gradient descent is only one method to find the solution, and it's actually quite a bad one at that (slow to converge). For example, Newton's method is much better for OLS (using various numerical algorithms to avoid inverting the Hessian directly).

But you are right in the sense that for very big problems, gradient descent becomes more useful because 2nd order methods like Newton's method can be computationally very expensive (again, there are approximations to that too).

I don't think EM is relevant for OLS, it can be useful for optimizing non-convex problems (OLS is convex).