Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,
For most nonlinear regression problems there is no closed form solution.
Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,
$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$
is given by
$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$
Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.
Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.
However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.
In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.
This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.
Gradient descent is actually a pretty poor way of solving a linear regression problem. The lm()
function in R internally uses a form of QR decomposition, which is considerably more efficient. However, gradient descent is a generally useful technique, and worth introducing in this simple context, so that it's clearer how to apply it in more complex problems. If you want to implement your own version as a learning exercise, it's a worthwhile thing to do, but lm()
is a better choice if all you want is a tool to do linear regression.
Best Answer
Linear Least squares can be solved by
0) Using high quality linear least squares solver, based on either SVD or QR, as described below, for unconstrained linear least squares, or based on a version of Quadratic Programming or Conic Optimization for bound or linearly constrained least squares, as described below. Such a solver is pre-canned, heavily tested, and ready to go - use it.
1) SVD, which is the most reliable and numerically accurate method, but also takes more computing than alternatives. In MATLAB, the SVD solution of the unconstrained linear least squares problem A*X = b is pinv(A) * b, which is very accurate and reliable.
2) QR, which is fairly reliable and numerically accurate, but not as much as SVD, and is faster than SVD. In MATLAB, the QR solution of the unconstrained linear least squares problem A*X = b is A\b, which is fairly accurate and reliable, except when A is ill-conditioned, i.e., has large condition number. A\b is faster to compute than pinv(A) * b, but not as reliable or accurate.
3) Forming the Normal equations (TERRIBLE from reliability and numerical accuracy standpoint, because it squares the condition number, which is a very bad thing to do) and
3a) solving by Cholesky Factorization (not good)
3b) explicitly inverting matrix (HORRIBLE)
4) Solving as a Quadratic Programming problem or Second Order Cone problem
4a) Solve using high quality Quadratic Programming software. This is reliable and numerically accurate, but takes longer than SVD or QR. However, it is easy to add bound or general linear constraints, or linear or quadratic (two norm) penalty or regularization terms to the objective function, and still solve the problem using Quadratic Programming software.
4b) Solve as a Second Order Cone problem using high quality Conic Optimization software. Remarks are the same as for Quadratic Programming software, but you can also add bound or general linear constraints and other conic constraints or objective function terms, such as penalty or regularization terms in various norms.
5) Solve using high quality general purpose nonlinear optimization software. This may still work well, but will in general be slower than Quadratic Programming or Conic Optimization software, and maybe not quite as reliable. However, it may be possible to include not only bound and general linear constraints, but also nonlinear constraints into the least squares optimization. Also, can be used for nonlinear least squares, and if other nonlinear terms are added to the objective function.
6) Solve using lousy general purpose nonlinear optimization algorithms --> DON'T EVER DO THIS.
7) Solve using THE WORST POSSIBLE general purpose nonlinear optimization algorithm there is, i.e., gradient descent. Use this only if you want to see how bad and unreliable a solution method can be If someone tells you to use gradient descent to solve linear least squares problems
7 i) Learn about statistical computing from someone who knows something about it
7 ii) Learn optimization from someone who knows something about it.