Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,
For most nonlinear regression problems there is no closed form solution.
Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,
$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$
is given by
$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$
Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.
Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.
However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.
In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.
This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.
Gradient descent is actually a pretty poor way of solving a linear regression problem. The lm()
function in R internally uses a form of QR decomposition, which is considerably more efficient. However, gradient descent is a generally useful technique, and worth introducing in this simple context, so that it's clearer how to apply it in more complex problems. If you want to implement your own version as a learning exercise, it's a worthwhile thing to do, but lm()
is a better choice if all you want is a tool to do linear regression.
Best Answer
There are several options available to you:
Try to compute the derivatives by hand and then implement them in code.
Use a symbolic computation package like Maple, Mathematica, Wolfram Alpha, etc. to find the derivatives. Some of these packages will translate the resulting formulas directly into code.
Use an automatic differentiation tool that takes a program for computing the cost function and (using compiler like techniques) produces a program that computes the derivatives as well as the cost function.
Use finite difference formulas to approximate the derivatives.
For anything other than the simplest problems (like ordinary least squares), option 1 is a poor choice. Most experts on optimization will tell you that it is very common for users of optimization software to supply incorrect derivative formulas to optimization routines. This typically leads to slow convergence or no convergence at all.
Option 2 is a good one for most relatively simple cost functions. It doesn't require really exotic tools.
Option 3 really shines when the cost function is the result of a fairly complicated function for which you have the source code. However, AD tools are specialized and not many users of optimization software are familiar with them.
Option 4 is sometimes a necessary choice. If you have a "black box" function that you can't get source code for (or that is so badly written that AD tools can't handle it), finite difference approximations can save the day. However, using finite difference approximations has a significant cost in run time and in the accuracy of the derivatives and ultimately the solutions obtained.
For most machine learning applications, options 1 and 2 are perfectly adequate. The loss functions (least squares, logistic regression, etc.) and penalties (one-norm, two-norm, elastic net, etc.) are simple enough that the derivatives are easy to find.
Options 3 and 4 come into play more often in engineering optimization where the objective functions are more complicated.