Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,
For most nonlinear regression problems there is no closed form solution.
Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,
$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$
is given by
$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$
Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.
Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.
However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.
In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.
This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.
Distortion of statistical properties can occur when you "fit to the data", so I think of this more in terms of specifying the number of parameters that I can afford to estimate and that I want to devote to the portion of the model that pertains to that one predictor. I use regression splines, place knots where $X$ is dense, and specify the number of knots (or the number of parameters and back calculate the number of knots) by asking (1) what does the sample size and distribution of $Y$ support and (2) what is the signal:noise ratio in this dataset. When $n \uparrow$ or signal:noise ratio $\uparrow$ I can use more knots. There is no set formula for the number of parameters that should be fitted, although in a minority of situations you can use cross-validation or AIC to determine this. As you mentioned, shrinkage is a great alternative, because you can start out with many parameters then shrink the coefficients down to what cross-validation or effective AIC dictate.
Best Answer
Natural language processing come to mind. For instance, you might predict the amount of money someone spends on your website by their review. The review is text, encoded by an n-gram model. The ith, jth element of your training matrix is the ith customer and counts of jth n-gram. An n-gram is a string of contiguous words like, "the product was excellent."
Typically n-gram data is encoded in a sparse matrix. There are several data-structures suitable. One of the easiest to explain is a coordinate list. A coordinate list is a list of tuples [(i, j, value)]. Since the matrix is mostly zero, this is much more efficient than allocating a dense array.