Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,
For most nonlinear regression problems there is no closed form solution.
Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,
$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$
is given by
$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$
Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.
Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.
However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.
In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.
This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.
I found that bootstrap gives estimates that are pretty close to those from OLS, but works with literally any training algorithm.
Bootstrap is a kind of Monte Carlo method and roughly boils down to repeated sampling with replacement from original dataset and collecting values of a target statistic. Having a set of statistic values, it becomes trivial to calculate their mean and standard error. G. James et al. provide experimental evidence of closeness of OLS and bootstrap results. Without further explanation, I'm giving a link to their excellent work (see pages 187-190 for bootstrap explanation and 195-197 for experiments):
Best Answer
Linear regression is commonly used as a way to introduce the concept of gradient descent.
QR factorization is the most common strategy. SVD and Cholesky factorization are other options. See Do we need gradient descent to find the coefficients of a linear regression model
In particular, note that the equations that you have written can evince poor numerical conditioning and/or be expensive to compute. QR factorization is less susceptible to conditioning issues (but not immune) and is not too expensive.