Unless the closed form solution is extremely expensive to compute, it generally is the way to go when it is available. However,
For most nonlinear regression problems there is no closed form solution.
Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.
For linear regression on a model of the form $y=X\beta$, where $X$ is a matrix with full column rank, the least squares solution,
$\hat{\beta} = \arg \min \| X \beta -y \|_{2}$
is given by
$\hat{\beta}=(X^{T}X)^{-1}X^{T}y$
Now, imagine that $X$ is a very large but sparse matrix. e.g. $X$ might have 100,000 columns and 1,000,000 rows, but only 0.001% of the entries in $X$ are nonzero. There are specialized data structures for storing only the nonzero entries of such sparse matrices.
Also imagine that we're unlucky, and $X^{T}X$ is a fairly dense matrix with a much higher percentage of nonzero entries. Storing a dense 100,000 by 100,000 element $X^{T}X$ matrix would then require $1 \times 10^{10}$ floating point numbers (at 8 bytes per number, this comes to 80 gigabytes.) This would be impractical to store on anything but a supercomputer. Furthermore, the inverse of this matrix (or more commonly a Cholesky factor) would also tend to have mostly nonzero entries.
However, there are iterative methods for solving the least squares problem that require no more storage than $X$, $y$, and $\hat{\beta}$ and never explicitly form the matrix product $X^{T}X$.
In this situation, using an iterative method is much more computationally efficient than using the closed form solution to the least squares problem.
This example might seem absurdly large. However, large sparse least squares problems of this size are routinely solved by iterative methods on desktop computers in seismic tomography research.
Maximizing correlation is useful when the output is highly noisy. In other words, the relationship between inputs and outputs is very weak. In such case, minimizing MSE will tend to make the output close to zero so that the predication error is the same as the variance of the training output.
Directly using correlation as objective function is possible for gradient descent approach (simply change it to minimizing minus correlation). However, I do not know how to optimize it with SGD approach, because the cost function and the gradient involves outputs of all training samples.
Another way to maximize correlation is to minimize MSE with constraining the output variance to be the same as training output variance. However, the constraint also involves all outputs thus there is no way (in my opinion) to take advantage of SGD optimizer.
EDIT:
In case the top layer of the neural network is a linear output layer, we can minimize MSE and then adjust the weights and bias in the linear layer to maximize the correlation. The adjustment can be done similarly to CCA (https://en.wikipedia.org/wiki/Canonical_analysis).
Best Answer
In some cases, yes. If you're using logistic regression as a classifier, then optimizing the cost is the same as optimizing the log likelihood. Not every model is like this. Deep neural networks do not map nicely onto a statistical counterpart (though there is some active research on their statistical properties I imagine).
As to variable selection via AIC or similar, that would be a form of feature selection. I think that is something to ask about on its own.
As to your titular question, inference is not our main concern; its predictive capability. Besides, most ML problems (most, not all) work with data sets so large that significance becomes a straw man. The sheer size of the data would allow for high precision estimates, and since no effect is truly 0 you would find that all effects are significant.