Solved – Gradient descent or not for simple linear regression

gradient descentregressionscikit learn

There are a number of websites describing gradient descent to find the parameters for simple linear regression (here is one of them). Google also describes it in their new (to the public) ML course.

However on Wikipedia, the following formulae to calculate the parameters are supplied:
$$
{\displaystyle {\begin{aligned}{\hat {\alpha }}&={\bar {y}}-{\hat {\beta }}\,{\bar {x}},\\{\hat {\beta }}&={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}\end{aligned}}}
$$

Also, the scikit-learn LinearRegression function, does not have an n_iter_ (number of iterations) attribute as it does for many other learning functions, which I suppose suggests gradient descent isn't being used?

Questions:

  1. Are the websites describing gradient descent for simple linear regression only doing so to teach the concept of it on the most basic ML model? Is the formula on Wikipedia what most stats software would use to calculate the parameters (at least scikit-learn does not seem to be using gradient descent)?
  2. What is typically used for multiple linear regression?
  3. For what types of statistical learning models is gradient descent typically used to find the parameters over other methods? I.e. is there some rule of thumb?

Best Answer

  1. Linear regression is commonly used as a way to introduce the concept of gradient descent.

  2. QR factorization is the most common strategy. SVD and Cholesky factorization are other options. See Do we need gradient descent to find the coefficients of a linear regression model

In particular, note that the equations that you have written can evince poor numerical conditioning and/or be expensive to compute. QR factorization is less susceptible to conditioning issues (but not immune) and is not too expensive.

  1. Neural networks are the most prominent example of applied use of gradient descent, but it is far from the only example. Another example of a problem that requires iterative updates is logistic regression, which does not allow for direct solutions, so typically Newton-Raphson is used. (But GD or its variants might also be used.)