Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,
For most nonlinear regression problems there is no closed form solution.
Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,
$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$
is given by
$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$
Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.
Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.
However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.
In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.
This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.
For ordinary linear regression, maximum likelihood and least squares are the same, i.e., give the same answer (the maximum likelihood solution is the least squares solution, if you derive the so called ``normal equations'' you'll see this, also see the book The Elements of Statistical Learning which discusses this).
But this is separate from how you find that solution. Gradient descent is only one method to find the solution, and it's actually quite a bad one at that (slow to converge). For example, Newton's method is much better for OLS (using various numerical algorithms to avoid inverting the Hessian directly).
But you are right in the sense that for very big problems, gradient descent becomes more useful because 2nd order methods like Newton's method can be computationally very expensive (again, there are approximations to that too).
I don't think EM is relevant for OLS, it can be useful for optimizing non-convex problems (OLS is convex).
Best Answer
Gradient descent and gradient ascent are the same algorithm. More precisely
is the same as
This is true in the sense that gradient ascent in the first case and gradient descent in the second generate the same sequence of points, the first converges if and only if the second converges, and in case they both converge, they both converge to the same place.
For logistic regression, the cost function is
$$ \pm \sum_i y_i \log(p_i) + (1 - y_i) \log(1 - p_i) $$
you get to choose one of these two options, it doesn't matter which, as long as you are consistent.
Since $p_i$ is between zero and one, $\log(p_i)$ is negative, hence
Further, by letting $p_i \rightarrow 0$ for a point with $y_i = 1$, we can drive this cost function all the way to $- \infty$ (which can also be accomplished by lettinf $p_i \rightarrow 1$ for a point with $y_i = 0$. So this cost function has the shape of an upside-down bowl, hence it should be maximized, using gradient ascent.
If we use the negative of this cost function
We can get exactly the opposite results (we can force it to $+ \infty$). So this is a rightside up bowl, and we should use gradient descent to minimize it.